layout: true --- class: inverse, center, middle background-image: url(../figs/titlepage16-9.png) background-size: cover <br> <br> # Bayesian Statistics and Computing ## Lecture 16: Least Squares <img src="../figs/slides.png" width="150px"/> #### *Yanfei Kang | BSC | Beihang University* --- # Motivation - Many real systems are **overdetermined** $$ Ax \approx b,\quad A \in \mathbb{R}^{m \times n},\; m > n. $$ - No exact solution in general. - Need the “best” approximate solution. -- - **Least Squares** solves $$ \min_x ||Ax - b||_2. $$ --- # Least Squares Objective Define the residual: $$ r = b - Ax $$ Least squares solution: $$ x^\* = \arg\min_x ||Ax - b||_2^2. $$ Equivalent to minimizing: $$ \sum_{i=1}^m (b_i - (Ax)_i)^2. $$ --- class: inverse, center, middle # Normal Equations --- # Normal Equations Take gradient = 0: $$ A^\top A x = A^\top b. $$ Solution: $$ x = (A^\top A)^{-1} A^\top b. $$ ### ⚠ Numerical Warning Condition number squared: $$ \kappa(A^\top A) = \kappa(A)^2. $$ --- # Cholesky Factorization - `\(A^\top A = R^\top R\)`, where `\(R\)` is upper triangular this is easy to solve. ### Cholesky Least Squares 1. Compute the Cholesky factorization `\(A^\top A = R^\top R\)`. 2. Solve the lower triangular system `\(R^\top w = A^\top b\)` for `\(w\)`. 3. Solve the upper triangular system `\(Rx = w\)` for `\(x\)`. --- class: inverse, center, middle # QR decomposition --- # QR decomposition QR decomposition can be done with Gram-Schmidt Orthogonalisation (GSO) algorithm. --- # GSO algorithm Any set of basis vectors `\((a_1, a_2, \cdots, a_n)\)` can be transformed to an orthonormal basis `\((q_1,q_2,\cdots,q_n)\)` by: <center> <img src="./figs/gso.png" height="400px" /> </center> --- # GSO `\(\rightarrow\)` QR <center> <img src="./figs/QR.png" height="200px" /> </center> --- # QR For any `\(m \times n\)` matrix `\(A\)`, we can express `\(A\)` as `\(A = QR\)`. - `\(Q\)` is `\(m\times m\)`, orthogonal. - `\(R\)` is `\(m\times n\)`, upper triangular. For (non-square) tall matrices `\(A\)` with `\(m > n\)`, the last `\(m-n\)` rows of `\(R\)` are all zero, so we can express `\(A\)` as: `$$A = QR = (Q_1, Q_2) \left(\begin{array}{cc}R_1 \\ \mathbf{0} \end{array}\right) = Q_1R_1.$$` --- # Solution to `\(Ax=b\)` by QR If we have `\(A=QR\)` or (even better) the economy form `\(A=Q_1R_1\)`, then the linear system `\(Ax = b\)` can be easily solved: `$$Ax = b$$` `$$(Q_1R_1)x=b$$` `$$R_1x = Q_1^Tb$$` and `\(x\)` is found through back substitution. --- # QR Least Squares 1. Compute the reduced QR factorization `\(A=Q_1R_1\)`. 2. Compute `\(Q_1^Tb\)`. 3. Solve the upper triangular system `\(R_1x = Q_1^Tb\)` for `\(x\)`. **This approach is more stable than the Cholesky approach and is considered the standard method for least squares problems.** --- # Example ```r set.seed(2023-09-08) n <- 5000; p <- 100 X <- matrix(rnorm(n * p), n, p) y <- rnorm(n) ``` --- # Colinearity ```r W <- cbind(X, X[, 1] + rnorm(n, sd = 0.00000000000001)) solve(crossprod(W), crossprod(W, y)) #> Error in solve.default(crossprod(W), crossprod(W, y)): system is computationally singular: reciprocal condition number = 1.89233e-16 ``` --- # QR decomposition ```r qr.coef(qr(W), y) #> [1] 0.0207049604 -0.0107798798 -0.0053346446 -0.0139042768 -0.0165625816 -0.0247368954 -0.0055936520 -0.0050670925 #> [9] -0.0141895452 0.0014650371 0.0105871852 0.0236855005 -0.0009123881 0.0395699043 -0.0045459668 0.0156426008 #> [17] 0.0368986437 -0.0062241571 0.0003719454 -0.0001045447 0.0105437196 0.0032069072 -0.0099112962 0.0092091722 #> [25] -0.0056283760 -0.0101836666 0.0004409933 0.0079016686 0.0101626882 -0.0208409039 -0.0054924536 0.0064271925 #> [33] -0.0065590754 -0.0037863800 0.0072126283 -0.0013686652 -0.0261588414 -0.0045260288 -0.0204600332 -0.0108791075 #> [41] 0.0100509988 0.0021846634 0.0096076628 -0.0256744385 -0.0002072185 -0.0203092434 0.0199490920 0.0245456994 #> [49] 0.0141368132 -0.0075821183 -0.0331999816 0.0070728928 -0.0251681102 -0.0145501245 0.0124959304 0.0101724169 #> [57] -0.0041669575 0.0118368027 0.0058702645 0.0017322817 0.0008573064 -0.0068535688 -0.0101274630 -0.0093643459 #> [65] 0.0045194364 -0.0224489607 0.0175468581 0.0135508696 0.0048772551 0.0168445216 -0.0254587733 -0.0012700923 #> [73] -0.0106302388 -0.0017762102 0.0077490915 0.0110904920 -0.0144777066 -0.0079745076 0.0112005231 -0.0012697611 #> [81] -0.0024046193 -0.0094265032 0.0234028545 -0.0087827166 0.0097232715 -0.0188332734 -0.0151177605 0.0129435368 #> [89] -0.0086064316 0.0048073761 -0.0209634872 -0.0133873152 0.0203494525 0.0031885641 -0.0088269877 0.0064179865 #> [97] -0.0016902431 -0.0149829401 -0.0018118949 -0.0161594580 NA ``` --- # Advantages - No need to form `\(A^\top A\)`. - Orthogonal transformations preserve numerical stability. --- class: inverse, center, middle # SVD Solution (Most Stable) --- # SVD Solution If: $$ A = U_1D V_1^\top, $$ then LS solution is: $$ x = V_1 D^{-1} U_1^\top b. $$ --- # SVD Least Squares 1. Compute the reduced SVD `\(A = U_1DV_1^T\)`. 2. Compute `\(U_1^\top b\)`. 3. Solve the diagonal system `\(Dw = U_1^\top b\)` for `\(w\)`. 4. Compute `\(x = V_1w\)`. --- # Benefits - Works even when `\(A\)` is rank-deficient. - Optimal numerical stability. - Connects to PCA & low-rank approximation. --- class: inverse, center, middle # Updating a Least Squares Solution of an Overdetermined System --- # Background We want to solve the LS problem: $$ \min_x ||Ax - b||_2, $$ where - We *already have* its least squares solution. - New data arrives: a new row `\(a^\top\)` and `\(b_{new}\)`. **Goal:** Update the LS solution *efficiently*, without recomputing everything. --- # Why Updating? Recomputing QR for the new matrix $$ A_{\text{new}}= \begin{bmatrix} A \\ a^\top \end{bmatrix} $$ costs: - `\(O(mn^2)\)` (expensive!) Updating QR costs only: - `\(O(n^2)\)` (much cheaper!) --- # Background: LS via QR Given: $$ A = QR, $$ the LS solution satisfies: $$ Rx = Q^\top b. $$ If new data is appended, we want to update `\(R\)` and `\(Q\)`. --- # Appending a New Row Consider: $$ A_{\text{new}}= \begin{bmatrix} A \\\\ a^\top \end{bmatrix} = \begin{bmatrix} Q & 0 \\\\ 0 & 1 \end{bmatrix} \begin{bmatrix} R \\\\ a^\top \end{bmatrix}. $$ Only the block $$ \begin{bmatrix} R \\\\ a^\top \end{bmatrix} $$ needs to be re-triangularized using Givens rotations. --- # Givens Rotations To zero out an entry `\(b\)` in a vector `\((a,b)^\top\)`: $$ c = \frac{a}{\sqrt{a^2 + b^2}},\qquad s = \frac{b}{\sqrt{a^2 + b^2}}, $$ $$ \begin{bmatrix} c & s \\\\ -s & c \end{bmatrix} \begin{bmatrix} a \\\\ b \end{bmatrix} = \begin{bmatrix} \sqrt{a^2+b^2} \\\\ 0 \end{bmatrix}. $$ This local 2D rotation is embedded into an `\(n\times n\)` identity matrix. --- # Updating QR with Givens Starting from: $$ \tilde{R} = \begin{bmatrix} R \\\\ a^\top \end{bmatrix}, $$ apply Givens rotations bottom-up to eliminate subdiagonal entries: `$$G_k \cdots G_1 \tilde{R} \rightarrow R_{\text{new}}.$$` Updated `\(Q\)`: $$ Q_{\text{new}} = \begin{bmatrix}Q & 0\\\\0&1\end{bmatrix} (G_1^\top \cdots G_k^\top). $$ --- class: inverse, center, middle # Weighted Least Squares --- # Linear Regression Model We begin with: $$ y = X\beta + \epsilon. $$ Standard OLS assumes: $$ \epsilon \sim N(0, \sigma^2 I). $$ But often the variance is **not constant**: $$ \text{Var}(\epsilon_i) = \sigma_i^2. $$ This is *heteroskedasticity*. --- # Why Use Weighted Least Squares? If each observation has different noise level: - Precise observations: small `\(\sigma_i^2\)` - Noisy observations: large `\(\sigma_i^2\)` WLS assigns weights: $$ w_i = \frac{1}{\sigma_i^2}. $$ WLS problem: `$$\hat\beta = \arg\min_\beta \sum_{i=1}^n w_i (y_i - x_i^T \beta)^2.$$` --- # Matrix Form of WLS Let $$ W = \mathrm{diag}(w_1,\dots,w_n). $$ Then WLS solves: $$ \hat\beta = \arg\min_\beta ||W^{1/2}(y - X\beta)||_2^2. $$ Equivalent to OLS on a transformed system. --- # Derivation of WLS Normal Equations Objective: $$ f(\beta) = (y - X\beta)^\top W (y - X\beta). $$ Gradient: $$ \nabla f = -2 X^\top W (y - X\beta). $$ Set `\(\nabla f = 0\)`: $$ X^\top W X \,\beta = X^\top W y. $$ Solution: $$ \hat\beta = (X^\top W X)^{-1} X^\top W y. $$ --- # Transforming to Equal Variance System Define: $$ y^\* = W^{1/2} y, \qquad X^\* = W^{1/2} X. $$ Multiply regression equation: $$ W^{1/2}y = W^{1/2} X\beta + W^{1/2}\epsilon. $$ Since: $$ \text{Var}(W^{1/2}\epsilon) = I, $$ the transformed model is homoskedastic: $$ y^\* = X^\*\beta + \epsilon^\*, \quad \epsilon^\* \sim N(0, I). $$ WLS is simply OLS applied to `\(\left(X^*, y^*\right)\)`. --- # QR Compute the QR decomposition: $$ X^\* = QR, $$ where `\(Q^\top Q = I\)` and `\(R\)` is upper triangular. WLS solution becomes: $$ R\beta = Q^\top y^\*. $$ Solve by back-substitution. --- # Lab session 1. Generate a matrix `\(X\)` and vector `\(y\)`. Compute the least squares estimator `\(\hat\beta\)` using Normal Equations, QR decomposition and SVD. Compare the three estimates. 2. Construct a nearly collinear matrix by duplicating a column in `\(X\)`. Compare how Normal Equations, QR, and SVD behave. 3. Simulate heteroskedastic errors with variances `\(\sigma_i^2\)`. Solve WLS with QR using weights `\(w_i = 1/\sigma_i^2\)`. Compare WLS and OLS estimates.