Introduction to Convex Optimization for Machine
Learning
John Duchi
University of California, Berkeley
Practical Machine Learning, Fall 2009
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 1 / 53
Outline
What is Optimization
Convex Sets
Convex Functions
Convex Optimization Problems
Lagrange Duality
Optimization Algorithms
Take Home Messages
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 2 / 53
What is Optimization
What is Optimization (and why do we care?)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 3 / 53
What is Optimization
What is Optimization?
◮ Finding the minimizer of a function subject to constraints:
minimize
x
f0(x)
. fi(x) ≤ 0, i = {1, . . . , k}
hj(x) = 0, j = {1, . . . , l}
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53
What is Optimization
What is Optimization?
◮ Finding the minimizer of a function subject to constraints:
minimize
x
f0(x)
. fi(x) ≤ 0, i = {1, . . . , k}
hj(x) = 0, j = {1, . . . , l}
◮ Example: Stock market. “Minimize variance of return subject to
getting at least $50.”
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 4 / 53
What is Optimization
Why do we care?
Optimization is at the heart of many (most practical?) machine learning
algorithms.
◮ Linear regression:
minimize
w
‖Xw − y‖2
◮ Classification (logistic regresion or SVM):
minimize
w
n∑
i=1
log
(
1 + exp(−yixTi w)
)
or ‖w‖2 + C
n∑
i=1
ξi . ξi ≥ 1− yixTi w, ξi ≥ 0.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 5 / 53
What is Optimization
We still care...
◮ Maximum likelihood estimation:
maximize
θ
n∑
i=1
log pθ(xi)
◮ Collaborative filtering:
minimize
w
∑
i≺j
log
(
1 + exp(wTxi − wTxj)
)
◮ k-means:
minimize
µ1,...,µk
J(µ) =
k∑
j=1
∑
i∈Cj
‖xi − µj‖2
◮ And more (graphical models, feature selection, active learning,
control)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 6 / 53
What is Optimization
But generally speaking...
We’re screwed.
◮ Local (non global) minima of f0
◮ All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
−50
0
50
100
150
200
250
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
What is Optimization
But generally speaking...
We’re screwed.
◮ Local (non global) minima of f0
◮ All kinds of constraints (even restricting to continuous functions):
h(x) = sin(2πx) = 0
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
−50
0
50
100
150
200
250
◮ Go for convex problems!
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53
Convex Sets
Convex Sets
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 8 / 53
Convex Sets
Definition
A set C ⊆ Rn is convex if for x, y ∈ C and any α ∈ [0, 1],
αx+ (1− α)y ∈ C.
x
y
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 9 / 53
Convex Sets
Examples
◮ All of Rn (obvious)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets
Examples
◮ All of Rn (obvious)
◮ Non-negative orthant, Rn+: let x � 0, y � 0, clearly
αx+ (1− α)y � 0.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets
Examples
◮ All of Rn (obvious)
◮ Non-negative orthant, Rn+: let x � 0, y � 0, clearly
αx+ (1− α)y � 0.
◮ Norm balls: let ‖x‖ ≤ 1, ‖y‖ ≤ 1, then
‖αx+ (1− α)y‖ ≤ ‖αx‖+ ‖(1− α)y‖ = α ‖x‖+ (1− α) ‖y‖ ≤ 1.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 10 / 53
Convex Sets
Examples
◮ Affine subspaces: Ax = b, Ay = b, then
A(αx+ (1− α)y) = αAx+ (1− α)Ay = αb+ (1− α)b = b.
0
1
0
1
−
−
0
1
x1x2
x
3
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 11 / 53
Convex Sets
More examples
◮ Arbitrary intersections of convex sets: let Ci be convex for i ∈ I,
C =
⋂
iCi, then
x ∈ C, y ∈ C ⇒ αx+ (1− α)y ∈ Ci ∀ i ∈ I
so αx+ (1− α)y ∈ C.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 12 / 53
Convex Sets
More examples
◮ PSD Matrices, . the
positive semidefinite cone
S
n
+ ⊂ Rn×n. A ∈ Sn+ means
xTAx ≥ 0 for all x ∈ Rn. For
A,B ∈ S+n ,
xT (αA+ (1− α)B)x
= αxTAx+ (1− α)xTBx ≥ 0.
◮ On right:
S
2
+ =
{[
x z
z y
]
� 0
}
=
{
x, y, z : x ≥ 0, y ≥ 0, xy ≥ z2}
0
1
−1
−
0
1
0
1
xy
z
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 13 / 53
Convex Functions
Convex Functions
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 14 / 53
Convex Functions
Definition
A function f : Rn → R is convex if for x, y ∈ dom f and any α ∈ [0, 1],
f(αx+ (1− α)y) ≤ αf(x) + (1− α)f(y).
f(x)
f(y)
αf(x) + (1 - α)f(y)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 15 / 53
Convex Functions
First order convexity conditions
Theorem
Suppose f : Rn → R is differentiable. Then f is convex if and only if for
all x, y ∈ dom f
f(y) ≥ f(x) +∇f(x)T (y − x)
f(y)
(x, f(x))
f(x) +∇f(x)T (y - x)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 16 / 53
Convex Functions
Actually, more general than that
Definition
The subgradient set, or subdifferential set, ∂f(x) of f at x is
∂f(x) =
{
g : f(y) ≥ f(x) + gT (y − x) for all y} .
Theorem
f : Rn → R is convex if and
only if it has non-empty
subdifferential set everywhere.
f(y)
f(x) + gT (y - x)
(x, f(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 17 / 53
Convex Functions
Second order convexity conditions
Theorem
Suppose f : Rn → R is twice differentiable. Then f is convex if and only if
for all x ∈ dom f ,
∇2f(x) � 0.
−2
−1
0
1
2
−2
−1
0
1
2
0
2
4
6
8
10
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 18 / 53
Convex Functions
Convex sets and convex functions
Definition
The epigraph of a function f is the
set of points
epi f = {(x, t) : f(x) ≤ t}.
◮ epi f is convex if and only if f is
convex.
◮ Sublevel sets, {x : f(x) ≤ a}
are convex for convex f .
epi f
a
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 19 / 53
Convex Functions Examples
Examples
◮ Linear/affine functions:
f(x) = bTx+ c.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53
Convex Functions Examples
Examples
◮ Linear/affine functions:
f(x) = bTx+ c.
◮ Quadratic functions:
f(x) =
1
2
xTAx+ bTx+ c
for A � 0. For regression:
1
2
‖Xw − y‖2 = 1
2
wTXTXw − yTXw + 1
2
yT y.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 20 / 53
Convex Functions Examples
More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
‖αx+ (1− α)y‖ ≤ ‖αx‖+ ‖(1− α)y‖ = α ‖x‖+ (1− α) ‖y‖ .
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples
More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
‖αx+ (1− α)y‖ ≤ ‖αx‖+ ‖(1− α)y‖ = α ‖x‖+ (1− α) ‖y‖ .
◮ Composition with an affine function f(Ax+ b):
f(A(αx+ (1− α)y) + b) = f(α(Ax+ b) + (1− α)(Ay + b))
≤ αf(Ax+ b) + (1− α)f(Ay + b)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples
More examples
◮ Norms (like ℓ1 or ℓ2 for regularization):
‖αx+ (1− α)y‖ ≤ ‖αx‖+ ‖(1− α)y‖ = α ‖x‖+ (1− α) ‖y‖ .
◮ Composition with an affine function f(Ax+ b):
f(A(αx+ (1− α)y) + b) = f(α(Ax+ b) + (1− α)(Ay + b))
≤ αf(Ax+ b) + (1− α)f(Ay + b)
◮ Log-sum-exp (via ∇2f(x) PSD):
f(x) = log
(
n∑
i=1
exp(xi)
)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 21 / 53
Convex Functions Examples
Important examples in Machine Learning
◮ SVM loss:
f(w) =
[
1− yixTi w
]
+
◮ Binary logistic loss:
f(w) = log
(
1 + exp(−yixTi w)
)
−2 3
0
3
[1 - x]+
log(1 + ex)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 22 / 53
Convex Optimization Problems
Convex Optimization Problems
Definition
An optimization problem is convex if its objective is a convex function, the
inequality constraints fj are convex, and the equality constraints hj are
affine
minimize
x
f0(x) (Convex function)
. fi(x) ≤ 0 (Convex sets)
hj(x) = 0 (Affine)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 23 / 53
Convex Optimization Problems
It’s nice to be convex
Theorem
If xˆ is a local minimizer of a convex optimization problem, it is a global
minimizer.
0 1 2 3
1
2
3
4
x∗
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 24 / 53
Convex Optimization Problems
Even more reasons to be convex
Theorem
∇f(x) = 0 if and only if x is a global minimizer of f(x).
Proof.
◮ ∇f(x) = 0. We have
f(y) ≥ f(x) +∇f(x)T (y − x) = f(x).
◮ ∇f(x) 6= 0. There is a direction of descent.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 25 / 53
Convex Optimization Problems
LET’S TAKE A BREAK
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 26 / 53
Lagrange Duality
Lagrange Duality
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 27 / 53
Lagrange Duality
Goals of Lagrange Duality
◮ Get certificate for optimality of a problem
◮ Remove constraints
◮ Reformulate problem
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 28 / 53
Lagrange Duality
Constructing the dual
◮ Start with optimization problem:
minimize
x
f0(x)
. fi(x) ≤ 0, i = {1, . . . , k}
hj(x) = 0, j = {1, . . . , l}
◮ Form Lagrangian using Lagrange multipliers λi ≥ 0, νi ∈ R
L(x, λ, ν) = f0(x) +
k∑
i=1
λifi(x) +
l∑
j=1
νjhj(x)
◮ Form dual function
g(λ, ν) = inf
x
L(x, λ, ν) = inf
x
f0(x) +
k∑
i=1
λifi(x) +
l∑
j=1
νjhj(x)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 29 / 53
Lagrange Duality
Remarks
◮ Original problem is equivalent to
minimize
x
[
sup
λ�0,ν
L(x, λ, ν)
]
◮ Dual problem is switching the min and max:
maximize
λ�0,ν
[
inf
x
L(x, λ, ν)
]
.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 30 / 53
Lagrange Duality
One Great Property of Dual
Lemma (Weak Duality)
If λ � 0, then
g(λ, ν) ≤ f0(x∗).
Proof.
We have
g(λ, ν) = inf
x
L(x, λ, ν) ≤ L(x∗, λ, ν)
= f0(x
∗) +
k∑
i=1
λifi(x
∗) +
l∑
j=1
νjhj(x
∗) ≤ f0(x∗).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 31 / 53
Lagrange Duality
The Greatest Property of the Dual
Theorem
For reasonable1 convex problems,
sup
λ�0,ν
g(λ, ν) = f0(x
∗)
1There are conditions called constraint qualification for which this is true
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 32 / 53
Lagrange Duality
Geometric Look
Minimize 12(x− c− 1)2 subject to x2 ≤ c.
−2 − −1 − 0 1 2
−2
0
2
4
6
8
10
12
14
x
0 1 2
−
−
0
λ
True function (blue), constraint
(green), L(x, λ) for different λ
(dotted)
Dual function g(λ) (black), primal op-
timal (dotted blue)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 33 / 53
Lagrange Duality
Intuition
Can interpret duality as linear approximation.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53
Lagrange Duality
Intuition
Can interpret duality as linear approximation.
◮ I−(a) =∞ if a > 0, 0 otherwise; I0(a) =∞ unless a = 0. Rewrite
problem as
minimize
x
f0(x) +
k∑
i=1
I−(fi(x)) +
l∑
j=1
I0(hj(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53
Lagrange Duality
Intuition
Can interpret duality as linear approximation.
◮ I−(a) =∞ if a > 0, 0 otherwise; I0(a) =∞ unless a = 0. Rewrite
problem as
minimize
x
f0(x) +
k∑
i=1
I−(fi(x)) +
l∑
j=1
I0(hj(x))
◮ Replace I(fi(x)) with λifi(x); a measure of “displeasure” when
λi ≥ 0, fi(x) > 0. νihj(x) lower bounds I0(hj(x)):
minimize
x
f0(x) +
k∑
i=1
λifi(x) +
l∑
j=1
νjhj(x)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 34 / 53
Lagrange Duality
Example: Linearly constrained least squares
minimize
x
1
2
‖Ax− b‖2 . Bx = d.
Form the Lagrangian:
L(x, ν) = 1
2
‖Ax− b‖2 + νT (Bx− d)
Take infimum:
∇xL(x, ν) = ATAx−AT b+BT ν ⇒ x = (ATA)−1(AT b−BT ν)
Simple unconstrained quadratic problem!
inf
x
L(x, ν)
=
1
2
∥∥A(ATA)−1(AT b−BT ν)− b∥∥2 + νTB((ATA)−1AT b−BT ν)− dT ν
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 35 / 53
Lagrange Duality
Example: Quadratically constrained least squares
minimize
x
1
2
‖Ax− b‖2 . 1
2
‖x‖2 ≤ c.
Form the Lagrangian (λ ≥ 0):
L(x, λ) = 1
2
‖Ax− b‖2 + 1
2
λ(‖x‖2 − 2c)
Take infimum:
∇xL(x, ν) = ATAx−AT b+ λI ⇒ x = (ATA+ λI)−1AT b
inf
x
L(x, λ) = 1
2
∥∥A(ATA+ λI)−1AT b− b∥∥2+λ
2
∥∥(ATA+ λI)−1AT b∥∥2−λc
One variable dual problem!
g(λ) = −1
2
bTA(ATA+ λI)−1AT b− λc+ 1
2
‖b‖2 .
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 36 / 53
Lagrange Duality
Uses of the Dual
◮ Main use: certificate of optimality (. duality gap). If we have
feasible x and know the dual g(λ, ν), then
g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x∗)− f0(x) ≥ g(λ, ν)− f0(x)
⇒ f0(x)− f0(x∗) ≤ f0(x)− g(λ, ν).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53
Lagrange Duality
Uses of the Dual
◮ Main use: certificate of optimality (. duality gap). If we have
feasible x and know the dual g(λ, ν), then
g(λ, ν) ≤ f0(x∗) ≤ f0(x) ⇒ f0(x∗)− f0(x) ≥ g(λ, ν)− f0(x)
⇒ f0(x)− f0(x∗) ≤ f0(x)− g(λ, ν).
◮ Also used in more advanced primal-dual algorithms (we won’t talk
about these).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 37 / 53
Optimization Algorithms
Optimization Algorithms
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 38 / 53
Optimization Algorithms Gradient Methods
Gradient Descent
The simplest algorithm in the world (almost). Goal:
minimize
x
f(x)
Just iterate
xt+1 = xt − ηt∇f(xt)
where ηt is stepsize.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 39 / 53
Optimization Algorithms Gradient Methods
Single Step Illustration
f(xt) - η∇f(xt)T (x - xt)
f(x)
(xt, f(xt))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 40 / 53
Optimization Algorithms Gradient Methods
Full Gradient Descent
f(x) = log(exp(x1 + 3x2 − .1) + exp(x1 − 3x2 − .1) + exp(−x1 − .1))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 41 / 53
Optimization Algorithms Gradient Methods
Stepsize Selection
How do I choose a stepsize?
◮ Idea 1: exact line search
ηt = argmin
η
f (x− η∇f(x))
Too expensive to be practical.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53
Optimization Algorithms Gradient Methods
Stepsize Selection
How do I choose a stepsize?
◮ Idea 1: exact line search
ηt = argmin
η
f (x− η∇f(x))
Too expensive to be practical.
◮ Idea 2: backtracking (Armijo) line search. Let α ∈ (0, 12), β ∈ (0, 1).
Multiply η = βη until
f (x− η∇f(x)) ≤ f(x)− αη ‖∇f(x)‖2
Works well in practice.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 42 / 53
Optimization Algorithms Gradient Methods
Illustration of Armijo/Backtracking Line Search
ηt = 0 η0
η
f(x) - ηα‖∇f(x)‖2
f(x - η∇f(x))
As a function of stepsize η. Clearly a region where f(x− η∇f(x)) is
below line f(x)− αη ‖∇f(x)‖2.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 43 / 53
Optimization Algorithms Gradient Methods
Newton’s method
Idea: use a second-order approximation to function.
f(x+∆x) ≈ f(x) +∇f(x)T∆x+ 1
2
∆xT∇2f(x)∆x
Choose ∆x to minimize above:
∆x = − [∇2f(x)]−1∇f(x)
This is descent direction:
∇f(x)T∆x = −∇f(x)T [∇2f(x)]−1∇f(x) < 0.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 44 / 53
Optimization Algorithms Gradient Methods
Newton step picture
(x, f(x))
(x + ∆x, f(x + ∆x))
f
fˆ
fˆ is 2nd-order approximation, f is true function.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 45 / 53
Optimization Algorithms Gradient Methods
Convergence of gradient descent and Newton’s method
◮ Strongly convex case: ∇2f(x) � mI, then “Linear convergence.” For
some γ ∈ (0, 1), f(xt)− f(x∗) ≤ γt, γ < 1.
f(xt)− f(x∗) ≤ γt or t ≥ 1
γ
log
1
ε
⇒ f(xt)− f(x∗) ≤ ε.
◮ Smooth case: ‖∇f(x)−∇f(y)‖ ≤ C ‖x− y‖.
f(xt)− f(x∗) ≤ K
t2
◮ Newton’s method often is faster, especially when f has “long valleys”
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 46 / 53
Optimization Algorithms Gradient Methods
What about constraints?
◮ Linear constraints Ax = b are easy. For example, in Newton method
(assume Ax = b):
minimize
∆x
∇f(x)T∆x+ 1
2
∆xT∇2f(x)∆x . A∆x = 0.
Solution ∆x satisfies A(x+∆x) = Ax+A∆x = b.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53
Optimization Algorithms Gradient Methods
What about constraints?
◮ Linear constraints Ax = b are easy. For example, in Newton method
(assume Ax = b):
minimize
∆x
∇f(x)T∆x+ 1
2
∆xT∇2f(x)∆x . A∆x = 0.
Solution ∆x satisfies A(x+∆x) = Ax+A∆x = b.
◮ Inequality constraints are a bit tougher...
minimize
∆x
∇f(x)T∆x+ 1
2
∆xT∇2f(x)∆x . fi(x+∆x) ≤ 0
just as hard as original.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 47 / 53
Optimization Algorithms Gradient Methods
Logarithmic Barrier Methods
Goal:
minimize
x
f0(x) . fi(x) ≤ 0, i = {1, . . . , k}
Convert to
minimize
x
f0(x) +
k∑
i=1
I−(fi(x))
Approximate I−(u) ≈ −t log(−u) for small t.
minimize
x
f0(x)− t
k∑
i=1
log (−fi(x))
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 48 / 53
Optimization Algorithms Gradient Methods
The barrier function
−3 0 1
−2
0
10
u
I−(u) is dotted line, others are −t log(−u).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 49 / 53
Optimization Algorithms Gradient Methods
Illustration
Minimizing cTx subject to Ax ≤ b.
c
t = 5
t = 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 50 / 53
Optimization Algorithms Subgradient Methods
Subgradient Descent
Really, the simplest algorithm in the world. Goal:
minimize
x
f(x)
Just iterate
xt+1 = xt − ηtgt
where ηt is a stepsize, gt ∈ ∂f(xt).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 51 / 53
Optimization Algorithms Subgradient Methods
Why subgradient descent?
◮ Lots of non-differentiable convex functions used in machine learning:
f(x) =
[
1− aTx]
+
, f(x) = ‖x‖1 , f(X) =
k∑
r=1
σr(X)
where σr is the rth singular value of X.
◮ Easy to analyze
◮ Do not even need true sub-gradient: just have Egt ∈ ∂f(xt).
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 52 / 53
Optimization Algorithms Subgradient Methods
Proof of convergence for subgradient descent
Idea: bound ‖xt+1 − x∗‖ using subgradient inequality. Assume that
‖gt‖ ≤ G.
‖xt+1 − x∗‖2 = ‖xt − ηgt − x∗‖2
= ‖xt − x∗‖2 − 2ηgTt (xt − x∗) + η2 ‖gt‖2
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53
Optimization Algorithms Subgradient Methods
Proof of convergence for subgradient descent
Idea: bound ‖xt+1 − x∗‖ using subgradient inequality. Assume that
‖gt‖ ≤ G.
‖xt+1 − x∗‖2 = ‖xt − ηgt − x∗‖2
= ‖xt − x∗‖2 − 2ηgTt (xt − x∗) + η2 ‖gt‖2
Recall that
f(x∗) ≥ f(xt) + gTt (x∗ − xt) ⇒ − gTt (xt − x∗) ≤ f(x∗)− f(xt)
so
‖xt+1 − x∗‖2 ≤ ‖xt − x∗‖2 + 2η [f(x∗)− f(xt)] + η2G2.
Then
f(xt)− f(x∗) ≤ ‖xt − x
∗‖2 − ‖xt+1 − x∗‖2
2η
+
η
2
G2.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 53 / 53
Optimization Algorithms Subgradient Methods
Almost done...
Sum from t = 1 to T :
T∑
t=1
f(xt)− f(x∗) ≤ 1
2η
T∑
t=1
[
‖xt − x∗‖2 − ‖xt+1 − x∗‖2
]
+
Tη
2
G2
=
1
2η
‖x1 − x∗‖2 − 1
2η
‖xT+1 − x∗‖2 + Tη
2
G2
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53
Optimization Algorithms Subgradient Methods
Almost done...
Sum from t = 1 to T :
T∑
t=1
f(xt)− f(x∗) ≤ 1
2η
T∑
t=1
[
‖xt − x∗‖2 − ‖xt+1 − x∗‖2
]
+
Tη
2
G2
=
1
2η
‖x1 − x∗‖2 − 1
2η
‖xT+1 − x∗‖2 + Tη
2
G2
Now let D = ‖x1 − x∗‖, and keep track of min along run,
f (xbest)− f(x∗) ≤ 1
2ηT
D2 +
η
2
G2.
Set η = D
G
√
T
and
f (xbest)− f(x∗) ≤ DG√
T
.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 54 / 53
Optimization Algorithms Subgradient Methods
Extension: projected subgradient descent
Now have a convex constraint set X.
Goal:
minimize
x∈X
f(x)
Idea: do subgradient steps, project
xt back into X at every iteration.
xt+1 = ΠX(xt − ηgt)
Proof:
‖ΠX(xt)− x∗‖ ≤ ‖xt − x∗‖
if x∗ ∈ X.
X
x∗
xt
ΠX(xt)
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 55 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Projected subgradient example
minimize
x
1
2
‖Ax− b‖ . ‖x‖1 ≤ 1
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 56 / 53
Optimization Algorithms Subgradient Methods
Convergence results for (projected) subgradient methods
◮ Any decreasing, non-summable stepsize ηt → 0,
∑∞
t=1 ηt =∞ gives
f
(
xavg(t)
)− f(x∗)→ 0.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53
Optimization Algorithms Subgradient Methods
Convergence results for (projected) subgradient methods
◮ Any decreasing, non-summable stepsize ηt → 0,
∑∞
t=1 ηt =∞ gives
f
(
xavg(t)
)− f(x∗)→ 0.
◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/
√
t
f
(
xavg(t)
)− f(x∗) ≤ C√
t
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53
Optimization Algorithms Subgradient Methods
Convergence results for (projected) subgradient methods
◮ Any decreasing, non-summable stepsize ηt → 0,
∑∞
t=1 ηt =∞ gives
f
(
xavg(t)
)− f(x∗)→ 0.
◮ Slightly less brain-dead analysis than earlier shows with ηt ∝ 1/
√
t
f
(
xavg(t)
)− f(x∗) ≤ C√
t
◮ Same convergence when gt is random, . Egt ∈ ∂f(xt). Example:
f(w) =
1
2
‖w‖2 + C
n∑
i=1
[
1− yixTi w
]
+
Just pick random training example.
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 57 / 53
Take Home Messages
Recap
◮ Defined convex sets and functions
◮ Saw why we want optimization problems to be convex (solvable)
◮ Sketched some of Lagrange duality
◮ First order methods are easy and (often) work well
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 58 / 53
Take Home Messages
Take Home Messages
◮ Many useful problems formulated as convex optimization problems
◮ If it is not convex and not an eigenvalue problem, you are out of luck
◮ If it is convex, you are golden
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 59 / 53