Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 16: Least Squares

#### *Yanfei Kang | BSC | Beihang University*

---
#  Motivation

- Many real systems are **overdetermined**  
  $$
  Ax \approx b,\quad A \in \mathbb{R}^{m \times n},\; m > n.
  $$

- No exact solution in general.  
- Need the “best” approximate solution.

- **Least Squares** solves  
$$
\min_x ||Ax - b||_2.
$$
---

#  Least Squares Objective

Define the residual:

$$
r = b - Ax
$$

Least squares solution:

$$
x^\* = \arg\min_x ||Ax - b||_2^2.
$$

Equivalent to minimizing:

$$
\sum_{i=1}^m (b_i - (Ax)_i)^2.
$$
---
class: inverse, center, middle
# Normal Equations

---
# Normal Equations

Take gradient = 0:

$$
A^\top A x = A^\top b.
$$

Solution:

$$
x = (A^\top A)^{-1} A^\top b.
$$

### ⚠ Numerical Warning
Condition number squared:

$$
\kappa(A^\top A) = \kappa(A)^2.
$$

---
# Cholesky Factorization

- `$A^\top A = R^\top R$`, where `$R$` is upper triangular this is easy to solve.

### Cholesky Least Squares

1. Compute the Cholesky factorization `$A^\top A = R^\top R$`.

2. Solve the lower triangular system `$R^\top w = A^\top b$` for `$w$`.

3. Solve the upper triangular system `$Rx = w$` for `$x$`.

---
class: inverse, center, middle
# QR decomposition

---
#  QR decomposition

QR decomposition can be done with Gram-Schmidt Orthogonalisation (GSO) algorithm.

---
# GSO algorithm

Any set of basis vectors `$(a_1, a_2, \cdots, a_n)$` can be transformed to an orthonormal basis `$(q_1,q_2,\cdots,q_n)$` by:

---
# GSO `$\rightarrow$` QR

---
# QR

For any `$m \times n$` matrix `$A$`, we can express `$A$` as `$A = QR$`.

- `$Q$` is `$m\times m$`, orthogonal.

- `$R$` is `$m\times n$`, upper triangular.

For (non-square) tall matrices `$A$`  with `$m > n$`, the last
`$m-n$` rows of `$R$` are all zero, so we can express `$A$` as:

`$$A = QR = (Q_1, Q_2) \left(\begin{array}{cc}R_1 \\ \mathbf{0} \end{array}\right) = Q_1R_1.$$`

---
# Solution to `$Ax=b$` by QR

If we have `$A=QR$` or (even better) the economy form `$A=Q_1R_1$`, then the linear system `$Ax = b$` can be easily solved:
`$$Ax = b$$`
`$$(Q_1R_1)x=b$$`
`$$R_1x = Q_1^Tb$$`
and `$x$` is found through back substitution.

---
# QR Least Squares

1. Compute the reduced QR factorization `$A=Q_1R_1$`.

2. Compute `$Q_1^Tb$`.

3. Solve the upper triangular system `$R_1x = Q_1^Tb$` for `$x$`.

**This approach is more stable than the Cholesky approach and is considered
the standard method for least squares problems.**

---
# Example

```r
set.seed(2023-09-08)
n <- 5000; p <- 100
X <- matrix(rnorm(n * p), n, p)
y <- rnorm(n)
```

---
#  Colinearity

```r
W <- cbind(X, X[, 1] + rnorm(n, sd = 0.00000000000001))
solve(crossprod(W), crossprod(W, y))
#> Error in solve.default(crossprod(W), crossprod(W, y)): system is computationally singular: reciprocal condition number = 1.89233e-16
```

---
#  QR decomposition

```r
qr.coef(qr(W), y)
#>   [1]  0.0207049604 -0.0107798798 -0.0053346446 -0.0139042768 -0.0165625816 -0.0247368954 -0.0055936520 -0.0050670925
#>   [9] -0.0141895452  0.0014650371  0.0105871852  0.0236855005 -0.0009123881  0.0395699043 -0.0045459668  0.0156426008
#>  [17]  0.0368986437 -0.0062241571  0.0003719454 -0.0001045447  0.0105437196  0.0032069072 -0.0099112962  0.0092091722
#>  [25] -0.0056283760 -0.0101836666  0.0004409933  0.0079016686  0.0101626882 -0.0208409039 -0.0054924536  0.0064271925
#>  [33] -0.0065590754 -0.0037863800  0.0072126283 -0.0013686652 -0.0261588414 -0.0045260288 -0.0204600332 -0.0108791075
#>  [41]  0.0100509988  0.0021846634  0.0096076628 -0.0256744385 -0.0002072185 -0.0203092434  0.0199490920  0.0245456994
#>  [49]  0.0141368132 -0.0075821183 -0.0331999816  0.0070728928 -0.0251681102 -0.0145501245  0.0124959304  0.0101724169
#>  [57] -0.0041669575  0.0118368027  0.0058702645  0.0017322817  0.0008573064 -0.0068535688 -0.0101274630 -0.0093643459
#>  [65]  0.0045194364 -0.0224489607  0.0175468581  0.0135508696  0.0048772551  0.0168445216 -0.0254587733 -0.0012700923
#>  [73] -0.0106302388 -0.0017762102  0.0077490915  0.0110904920 -0.0144777066 -0.0079745076  0.0112005231 -0.0012697611
#>  [81] -0.0024046193 -0.0094265032  0.0234028545 -0.0087827166  0.0097232715 -0.0188332734 -0.0151177605  0.0129435368
#>  [89] -0.0086064316  0.0048073761 -0.0209634872 -0.0133873152  0.0203494525  0.0031885641 -0.0088269877  0.0064179865
#>  [97] -0.0016902431 -0.0149829401 -0.0018118949 -0.0161594580            NA
```

---
# Advantages

- No need to form `$A^\top A$`.
- Orthogonal transformations preserve numerical stability.

---
class: inverse, center, middle
# SVD Solution (Most Stable)

---
# SVD Solution

If:

$$
A = U_1D V_1^\top,
$$

then LS solution is:

$$
x = V_1 D^{-1} U_1^\top b.
$$

---
# SVD Least Squares

1. Compute the reduced SVD `$A = U_1DV_1^T$`.

2. Compute `$U_1^\top b$`.

3. Solve the diagonal system `$Dw = U_1^\top b$` for `$w$`.

4. Compute `$x = V_1w$`.

---
# Benefits
- Works even when `$A$` is rank-deficient.  
- Optimal numerical stability. 
- Connects to PCA & low-rank approximation.

---
class: inverse, center, middle
# Updating a Least Squares Solution of an Overdetermined System

---
# Background

We want to solve the LS problem:

$$
\min_x ||Ax - b||_2,
$$
where  
- We *already have* its least squares solution.
- New data arrives: a new row `$a^\top$` and `$b_{new}$`.

**Goal:**  
Update the LS solution *efficiently*, without recomputing everything.

---
# Why Updating?

Recomputing QR for the new matrix

$$
A_{\text{new}}=
\begin{bmatrix}
A \\ a^\top
\end{bmatrix}
$$

costs:

- `$O(mn^2)$`  (expensive!)

Updating QR costs only:

- `$O(n^2)$` (much cheaper!)

---

#  Background: LS via QR

Given:

$$
A = QR,
$$

the LS solution satisfies:

$$
Rx = Q^\top b.
$$

If new data is appended, we want to update `$R$` and `$Q$`.

---

#  Appending a New Row

Consider:

$$
A_{\text{new}}=
\begin{bmatrix}
A \\\\ a^\top
\end{bmatrix}
=
\begin{bmatrix}
Q & 0 \\\\
0 & 1
\end{bmatrix}
\begin{bmatrix}
R \\\\ a^\top
\end{bmatrix}.
$$

Only the block

$$
\begin{bmatrix} R \\\\ a^\top \end{bmatrix}
$$

needs to be re-triangularized using Givens rotations.

---

# Givens Rotations

To zero out an entry `$b$` in a vector `$(a,b)^\top$`:

$$
c = \frac{a}{\sqrt{a^2 + b^2}},\qquad  
s = \frac{b}{\sqrt{a^2 + b^2}},
$$

$$
\begin{bmatrix}
c & s \\\\
-s & c
\end{bmatrix}
\begin{bmatrix}
a \\\\ b
\end{bmatrix}
=
\begin{bmatrix}
\sqrt{a^2+b^2} \\\\ 0
\end{bmatrix}.
$$

This local 2D rotation is embedded into an `$n\times n$` identity matrix.

---

# Updating QR with Givens

Starting from:

$$
\tilde{R} =
\begin{bmatrix}
R \\\\ a^\top
\end{bmatrix},
$$

apply Givens rotations bottom-up to eliminate subdiagonal entries:

`$$G_k \cdots G_1
\tilde{R} \rightarrow R_{\text{new}}.$$`

Updated `$Q$`:

$$
Q_{\text{new}} = 
\begin{bmatrix}Q & 0\\\\0&1\end{bmatrix}
(G_1^\top \cdots G_k^\top).
$$

---
class: inverse, center, middle
# Weighted Least Squares

---
# Linear Regression Model

We begin with:

$$
y = X\beta + \epsilon.
$$

Standard OLS assumes:

$$
\epsilon \sim N(0, \sigma^2 I).
$$

But often the variance is **not constant**:

$$
\text{Var}(\epsilon_i) = \sigma_i^2.
$$

This is *heteroskedasticity*.

---
# Why Use Weighted Least Squares?

If each observation has different noise level:

- Precise observations: small `$\sigma_i^2$`
- Noisy observations: large `$\sigma_i^2$`

WLS assigns weights:

$$
w_i = \frac{1}{\sigma_i^2}.
$$

WLS problem:

`$$\hat\beta = 
\arg\min_\beta \sum_{i=1}^n 
w_i (y_i - x_i^T \beta)^2.$$`

---

#  Matrix Form of WLS

Let

$$
W = \mathrm{diag}(w_1,\dots,w_n).
$$

Then WLS solves:

$$
\hat\beta = 
\arg\min_\beta ||W^{1/2}(y - X\beta)||_2^2.
$$

Equivalent to OLS on a transformed system.

---

#  Derivation of WLS Normal Equations

Objective:

$$
f(\beta) = (y - X\beta)^\top W (y - X\beta).
$$

Gradient:

$$
\nabla f = -2 X^\top W (y - X\beta).
$$

Set `$\nabla f = 0$`:

$$
X^\top W X \,\beta = X^\top W y.
$$

Solution:

$$
\hat\beta = (X^\top W X)^{-1} X^\top W y.
$$

---

#  Transforming to Equal Variance System

Define:

$$
y^\* = W^{1/2} y, \qquad X^\* = W^{1/2} X.
$$

Multiply regression equation:

$$
W^{1/2}y = W^{1/2} X\beta + W^{1/2}\epsilon.
$$

Since:

$$
\text{Var}(W^{1/2}\epsilon) = I,
$$

the transformed model is homoskedastic:

$$
y^\* = X^\*\beta + \epsilon^\*, \quad \epsilon^\* \sim N(0, I).
$$

WLS is simply OLS applied to `$\left(X^*, y^*\right)$`.

---

# QR

Compute the QR decomposition:

$$
X^\* = QR,
$$

where `$Q^\top Q = I$` and `$R$` is upper triangular.

WLS solution becomes:

$$
R\beta = Q^\top y^\*.
$$

Solve by back-substitution.

---
# Lab session

1. Generate a matrix `$X$` and vector `$y$`. Compute the least squares estimator `$\hat\beta$` using Normal Equations, QR decomposition and SVD. Compare the three estimates.

2. Construct a nearly collinear matrix by duplicating a column in `$X$`. Compare how Normal Equations, QR, and SVD behave.

3. Simulate heteroskedastic errors with variances `$\sigma_i^2$`. Solve WLS with QR using weights `$w_i = 1/\sigma_i^2$`. Compare WLS and OLS estimates.