Optimizing Cross-Entropy#
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from scipy.optimize import minimize
Regularized Cross-Entropy#
Consider \(N\) points \((\mathbf{x}_0,y_0),\dots,(\mathbf{x}_{N-1},y_{N-1})\) where \(\mathbf{x}_k \in \mathbb{R}^n\) for each \(k\), and \(y_k = 0\) or \(y_k = 1\) for each \(k\). The goal of logistic regression is to compute parameters \(\mathbf{W} \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) which minimize the regularized cross entropy function
where \(\alpha > 0\) is the regularization parameter and
Gradient of Cross-Entropy#
We need to compute the gradient of \(C(\mathbf{W},b,\alpha)\) to implement the gradient descent method to find optimal parameters for logistic regression. Let’s begin with the derivatives of the logistic function \(\sigma(\mathbf{x}; \mathbf{W},b)\) where \(\mathbf{x} = (x_0,\dots,x_{n-1})\) and \(\mathbf{W} = (W_0,\dots,W_{n-1})\).
Let \(z = \langle \mathbf{W} , \mathbf{x} \rangle + b\) and \(\sigma(z) = \sigma(\mathbf{x}; W,b)\). Recall
Compute and simplify the partial derivative with respect to \(W_{\ell}\)
where \(x_{\ell}\) is the entry of \(\mathbf{x}\) at index \(\ell\). Therefore
Similarly
Compute the partial derivative of \(C(\mathbf{W},b,\alpha)\) with respect to \(W_{\ell}\)
where \(x_{k,\ell}\) is the entry of \(\mathbf{x}_k\) at index \(\ell\). Similarly
Note that we can write these derivatives in matrix notation as
where
Implementation#
def gradient_descent(f,grad,x0,alpha=0.1,max_iter=100,eps=1e-4):
X = np.zeros([max_iter+1,len(x0)])
X[0,:] = x0
for k in range(max_iter):
xk = X[k,:]
grad_xk = grad(xk)
if np.linalg.norm(grad_xk) < eps:
# Truncate the matrix X to include
# only rows up to current index k
X = X[:k+1,:]
break
X[k+1,:] = xk - alpha*grad_xk
return X
def C(W,b,X,y,alpha):
N,p = X.shape
W = np.array(W).reshape(p,1)
y = np.array(y).reshape(N,1)
S = 1/(1 + np.exp(-(X@W + b)))
E = -1/N*np.sum(y*np.log(S) + (1 - y)*np.log(1 - S))
R = alpha*np.sum(W*W)
return E + R
def gradC(W,b,X,y,alpha):
N,p = X.shape
W = np.array(W).reshape(p,1)
y = np.array(y).reshape(N,1)
S = 1/(1 + np.exp(-(X@W + b)))
dCdW = -1/N*X.T@(y - S) + 2*alpha*W
dCdb = -1/N*np.sum(y - S)
return np.hstack([dCdW.flatten(),dCdb])
Let’s apply gradient descent to the regularized cross-entropy function for the example below:
X = np.array([[2,3],[-1,2],[0,1],[1,-3],[4,2],[-1,-3],[3,0],[2,2],[2,-1],[-2,2],[-1,0],[0,-1]])
y = [1,1,1,0,1,0,0,1,0,1,1,0]
plt.scatter(X[:,0],X[:,1],c=y), plt.colorbar()
plt.grid(True)
plt.show()
Define the vector function \(f(\mathbf{x})\) and \(\nabla f\) which represents \(C(\mathbf{W},b,\alpha)\) and its gradient.
alpha = 0.01
f = lambda x: C(x[:-1],x[-1],X,y,alpha)
gradf = lambda x: gradC(x[:-1],x[-1],X,y,alpha)
Apply gradient descent:
P = gradient_descent(f,gradf,[-1,3,1],alpha=0.01,max_iter=1000,eps=1e-10)
View the first 5 points:
P[:5]
array([[-1. , 3. , 1. ],
[-1.00005735, 2.9996237 , 0.99994556],
[-1.00011447, 2.99924761, 0.99989116],
[-1.00017135, 2.99887172, 0.9998368 ],
[-1.000228 , 2.99849603, 0.99978248]])
And the last 5 points in the:
P[-5:]
array([[-0.98715577, 2.71251609, 0.95642125],
[-0.98710747, 2.7123004 , 0.95638196],
[-0.98705914, 2.71208484, 0.95634266],
[-0.98701079, 2.71186939, 0.95630337],
[-0.98696242, 2.71165406, 0.95626408]])
Let’s compare to the result computed by LogisticRegression
.
model = LogisticRegression(C=10).fit(X,y)
model.coef_
array([[-1.09701124, 3.02449763]])
model.intercept_
array([0.93960308])
The result is similar but not exactly the same. We can take a closer look at the documentation for LogisticRegression
to see how it works.
Let’s also compare the result computed by scipy.optimize.minimize
.
result = minimize(f,[-1,3,1])
result
message: Optimization terminated successfully.
success: True
status: 0
fun: 0.12531909819315135
x: [-8.221e-01 2.314e+00 7.267e-01]
nit: 13
jac: [-3.507e-06 2.081e-06 -7.134e-06]
hess_inv: [[ 5.836e+00 -4.121e+00 -6.103e+00]
[-4.121e+00 1.443e+01 2.905e+00]
[-6.103e+00 2.905e+00 2.368e+01]]
nfev: 60
njev: 15
result.x
array([-0.82214028, 2.31409617, 0.72666693])
Again the result is similar but not exactly the same.