Least Squares

Least Squares#

import numpy as np
import matplotlib.pyplot as plt
import scipy.linalg as la
import pandas as pd

Residual Sum of Squares#

Let \((\mathbf{x}_1,y_1),\dots,(\mathbf{x}_N,y_N)\) be a data set of \(N\) observations where each \(\mathbf{x}_i = (x_{i,1},\dots,x_{i,p})\) is a vector of values of the independent variables \(x_1,\dots,x_p\). Consider a linear regression model

\[ y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \varepsilon \]

The residual sum of squares is the cost function

\[ RSS = \sum_{i=1}^N |y_i - (\beta_0 + \beta_1 x_{i,1} + \cdots + \beta_p x_{i,p}) |^2 \]

The goal of least squares linear regression is to compute parameters \(\beta_0,\beta_1,\dots,\beta_p\) which minimize \(RSS\).

Normal Equations#

Note that the expression for \(RSS\) above can be written in vector form using matrix multiplication

\[ RSS = || \mathbf{y} - X \boldsymbol{\beta} ||^2 \]

where

\[\begin{split} \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix} \hspace{20mm} X = \begin{bmatrix} 1 & x_{1,1} & \cdots & x_{1,p} \\ 1 & x_{2,1} & \cdots & x_{2,p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{N,1} & \cdots & x_{N,p} \end{bmatrix} \hspace{20mm} \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} \end{split}\]

The vector \(\boldsymbol{\beta}\) which minimizes \(RSS\) is the unique solution of the normal equations

\[ X^T X \boldsymbol{\beta} = X^T \mathbf{y} \]

Let’s see why this is true. Note that \(X \boldsymbol{\beta}\) is a vector in the column space of \(X\) therefore the distance \(|| \mathbf{y} - X \boldsymbol{\beta} ||\) is minimized when the line from \(\mathbf{y}\) to \(X\boldsymbol{\beta}\) is orthogonal to the column space of \(X\). In other words, we want to find \(\boldsymbol{\beta}\) such that \(\mathbf{y} - X \boldsymbol{\beta}\) is in the nullspace of \(X^T\) which leads to the equation \(X^T(\mathbf{y} - X \boldsymbol{\beta}) = 0\) and finally \(X^T X \boldsymbol{\beta} = X^T \mathbf{y}\).

Example: Real Estate Prices#

The data below was collected from REALTOR.ca for 2-bedroom apartments for sale in the Kitsilano neighborhood of Vancouver. The dependent variable \(y\) is the listing price and the independent variable \(x\) is the apartment size (in square feet).

x = [1557,1452,767,900,1018,802,924,981,879,819,829,1088,1085]
y = [1.398,1.750,0.899,0.848,1.149,0.825,0.999888,1.175,0.885,0.749888,1.098,1.099,1.198]
plt.plot(x,y,'.')
plt.title('Listing Price on REALTOR.ca')
plt.xlabel('Apartment Size (sqft)'), plt.ylabel('Price (Millons $)'), plt.grid(True)
plt.show()

../../_images/9010168de4908d66437458cc424d6be0dbc7681e98b27a0d37f58dd00698e155.png

Choose a simple linear regression model for the data \(y = \beta_0 + \beta_1 x + \varepsilon\). Construct the matrix \(X\):

X = np.column_stack([np.ones(len(x)),x])
X

array([[1.000e+00, 1.557e+03],
       [1.000e+00, 1.452e+03],
       [1.000e+00, 7.670e+02],
       [1.000e+00, 9.000e+02],
       [1.000e+00, 1.018e+03],
       [1.000e+00, 8.020e+02],
       [1.000e+00, 9.240e+02],
       [1.000e+00, 9.810e+02],
       [1.000e+00, 8.790e+02],
       [1.000e+00, 8.190e+02],
       [1.000e+00, 8.290e+02],
       [1.000e+00, 1.088e+03],
       [1.000e+00, 1.085e+03]])

Solve the normal equations:

beta = la.solve(X.T@X,X.T@y)
beta

array([0.10620723, 0.00096886])

Plot the linear regression model with the data:

xs = np.linspace(700,1600)
ys = beta[0] + beta[1]*xs
plt.plot(x,y,'.',xs,ys)
plt.title('Listing Price on REALTOR.ca'), plt.grid(True)
plt.xlabel('Apartment Size (sqft)'), plt.ylabel('Price (Millons $)')
plt.show()

../../_images/7f9cdba9c9a84c784bdb5ae8a12f41500b49746b0c7d97c08c3078bbad494c6f.png

The coefficient \(\beta_1 = 0.00096886\) suggests that the listing price of a 2-bedroom apartment increases by \(968.86\) dollars per square foot.

Example: Temperature#

The file temperature.csv includes daily average temperature measured at the Vancouver Airport from 1995 to 2023 (see vancouver.weatherstats.ca). Let’s import the data, look the first few rows and then plot the average tempearture versus day of the year.

df = pd.read_csv('temperature.csv')

df.head()

	day	month	year	dayofyear	avg_temperature
0	13	4	2023	103	7.10
1	12	4	2023	102	5.19
2	11	4	2023	101	8.00
3	10	4	2023	100	7.69
4	9	4	2023	99	9.30

df.tail()

	day	month	year	dayofyear	avg_temperature
9995	1	12	1995	335	6.40
9996	30	11	1995	334	9.15
9997	29	11	1995	333	11.50
9998	28	11	1995	332	9.75
9999	27	11	1995	331	6.90

plt.plot(df['dayofyear'],df['avg_temperature'],'.',alpha=0.1,lw=0)
plt.title('Daily Average Temperature 1995-2023')
plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)'), plt.grid(True)
plt.show()

../../_images/100af5e710c81e9dc28113b15fb01d47d1b3efd2a84cf4dcfcb8ea6ecf2f5946.png

Let \(T\) by the temperature and let \(d\) be the day of the year. Define a regression model of the form

\[ T = \beta_0 + \beta_1 \cos(2 \pi d/365) + \beta_2 \sin(2 \pi d/365) \]

Note that if we rewrite the variables as \(y = T\), \(x_1 = \cos(2 \pi d/365)\) and \(x_2 = \sin(2 \pi d/365)\) then we see that this is indeed a linear regression model

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

Construct the corresponding data matrix \(X\) and compute the coefficient vector \(\boldsymbol{\beta}\).

N = len(df)
d = df['dayofyear']
T = df['avg_temperature']
X = np.column_stack([np.ones(N),np.cos(2*np.pi*d/365),np.sin(2*np.pi*d/365)])
beta = la.solve(X.T@X,X.T@T)

ds = np.linspace(0,365,500)
Ts = beta[0] + beta[1]*np.cos(2*np.pi*ds/365) + beta[2]*np.sin(2*np.pi*ds/365)

plt.plot(df['dayofyear'],df['avg_temperature'],'.',alpha=0.1,lw=0)
plt.plot(ds,Ts,'r'), plt.grid(True)

plt.title('Daily Average Temperature 1995-2023')
plt.xlabel('Day of the Year'), plt.ylabel('Temperature (Celsius)')
plt.show()

../../_images/f841af2d54fe04ac7c3ad661577a05dbc7c334c2ef6caa9263d5222b495123f1.png