Gradient Descent#
import numpy as np
import matplotlib.pyplot as plt
Directional Derivatives#
Let
The value
See also
Check out Wikipedia: Directional Derivative for more information.
Gradient Vector#
The gradient of
Therefore the directional derivative of
This shows that the maximum value of
See also
Check out Wikipedia: Gradient for more information.
Gradient Descent#
Suppose we want to minimize the function
The sequence
Let’s apply gradient descent to the function
The function is a paraboloid and the global minimum occurs at
x = np.linspace(-2,4,50)
y = np.linspace(-4,2,50)
X,Y = np.meshgrid(x,y)
Z = (X - 1)**2 + 2*(Y + 1)**2
plt.contourf(X,Y,Z,levels=20,cmap='RdBu'), plt.colorbar()
plt.axis('equal'),plt.grid(True)
plt.show()

Compute the gradient
and compute the sequence given by gradient descent starting from
x = np.linspace(-2,4,50)
y = np.linspace(-4,2,50)
X,Y = np.meshgrid(x,y)
Z = (X - 1)**2 + 2*(Y + 1)**2
plt.contourf(X,Y,Z,levels=20,cmap='RdBu'), plt.colorbar()
xk = np.array([2.,1.])
plt.plot(xk[0],xk[1],'k.')
alpha = 0.1
for k in range(100):
grad = np.array([2*xk[0] - 2,4*xk[1] + 4])
xk = xk - alpha*grad
plt.plot(xk[0],xk[1],'k.')
if np.linalg.norm(grad) < 1e-4:
break
plt.axis('equal'),plt.grid(True)
plt.show()

Implementation#
Let’s write a function called gradient_descent
which takes input parameters:
f
is a vector functiongrad
is the gradientx0
is an initial pointalpha
is the step sizemax_iter
is the maximum number of iterationseps
is the stopping criteria
The function returns a X
where row
The algorithm stops when max_iter
. Therefore the number of rows of X
is determined by the number of iterations completed. Let’s include default values alpha=0.1
, max_iter=100
and eps=1e-4
.
def gradient_descent(f,grad,x0,alpha=0.1,max_iter=100,eps=1e-4):
X = np.zeros([max_iter+1,len(x0)])
X[0,:] = x0
for k in range(max_iter):
xk = X[k,:]
grad_xk = grad(xk)
if np.linalg.norm(grad_xk) < eps:
# Truncate the matrix X to include
# only rows up to current index k
X = X[:k+1,:]
break
X[k+1,:] = xk - alpha*grad_xk
return X
Apply the function gradient_descent
to the previous example.
f = lambda x: (x[0] - 1)**2 + 2*(x[1] + 1)**2
grad = lambda x: np.array([2*(x[0] - 1),4*(x[1] + 1)])
X = gradient_descent(f,grad,[2,1])
Check the first few iterations:
X[:5,:]
array([[ 2. , 1. ],
[ 1.8 , 0.2 ],
[ 1.64 , -0.28 ],
[ 1.512 , -0.568 ],
[ 1.4096, -0.7408]])
Check the last iterations:
X[-5:,:]
array([[ 1.00010634, -1. ],
[ 1.00008507, -1. ],
[ 1.00006806, -1. ],
[ 1.00005445, -1. ],
[ 1.00004356, -1. ]])
Let’s see how many iterations were computed in total:
X.shape
(46, 2)
Check the norm of the gradient at the last iteration. It should be less than
np.linalg.norm(grad(X[-1,:]))
8.711228593580309e-05
And the norm of the gradient at the second last iteration should be more than
np.linalg.norm(grad(X[-2,:]))
0.00010889035742372364
It works as expected!
Example#
Consider the function
Plot some levels curves of the function:
x = np.linspace(-4,4,50)
y = np.linspace(-2,3,50)
X,Y = np.meshgrid(x,y)
Z = 5*X**2 + 4*Y**2 + X**4 + 2*Y**4 - 10*X**2*Y
plt.contourf(X,Y,Z,levels=[-10,-5,-2,-1,0,1,2,5,10],cmap='RdBu'), plt.colorbar()
plt.grid(True)
plt.show()

We see 5 critical points. Let’s approximate the local minimum near
Apply gradient descent to
f = lambda x: 5*x[0]**2 + 4*x[1]**2 + x[0]**4 + 2*x[1]**4 - 10*x[0]**2*x[1]
grad = lambda x: np.array([10*x[0] + 4*x[0]**3 - 20*x[0]*x[1], 8*x[1] + 8*x[1]**3 - 10*x[0]**2])
X = gradient_descent(f,grad,[2.5,1.8],0.01)
Inspect the first 5 points:
X[:5,:]
array([[2.5 , 1.8 ],
[2.525 , 1.81444 ],
[2.54485407, 1.82896844],
[2.56201249, 1.84082887],
[2.57638398, 1.85091924]])
And the last 5 points:
X[-5:,:]
array([[2.64421542, 1.89837822],
[2.6442171 , 1.89837939],
[2.64421847, 1.89838035],
[2.64421957, 1.89838112],
[2.64422047, 1.89838175]])
Plot all the points in the sequence to see it converge to the local minimum:
x = np.linspace(2,3,50)
y = np.linspace(1.5,2.5,50)
Xs,Ys = np.meshgrid(x,y)
Zs = 5*Xs**2 + 4*Ys**2 + Xs**4 + 2*Ys**4 - 10*Xs**2*Ys
plt.contourf(Xs,Ys,Zs,levels=20,cmap='RdBu'), plt.colorbar()
plt.plot(X[:,0],X[:,1],'k.',alpha=0.5)
plt.grid(True)
plt.show()

See also
Check out Wikipedia: Gradient Descent for more information.