Statistical Computing

School of Economics and Management
Beihang University
http://yanfei.site

Applications of SVD

In data compression, we start with a matrix A that contains perfect data, and we try to find a (lower-rank) approximation to the data that seeks to capture the principal elements of the data.
- we sacrifice only the less important parts that don’t degrade data quality too much, in order to gain compression.
In noise filtering, we start with a matrix A that contains imperfect data, and we try to find a (lower-rank) approximation to the data that seeks to capture the principal elements of the data.
- we give up only the less important parts that are typically noise.
So both these tasks are related, and rely on the SVD to find a suitable lower-ranked approximation to a matrix

Matrix Approximation

Recall that the SVD of a matrix \(A\) in \(\mathcal{R}_{m\times n}\) decomposes the matrix into \[A = U\Sigma V^T = \left(\begin{array}{cc}U_1 & U_2\end{array}\right) \left(\begin{array}{cc} D & \\ & 0 \end{array}\right) \left(\begin{array}{cc}V_1^T & V_2^T\end{array}\right) = U_1^TDV_1^T\] where \(D = \text{diag}\{\alpha_1, \cdots, \alpha_r\} \in \mathcal{R}^{r\times r}\) with \(\sigma_r > 0\) for \(r = \text{rank}(A)\).
\(A\) can also be expressed as a sum of outer products, with each sum being a rank 1 matrix of dimension \(m\times n\) \[A = \Sigma_{i = 1}^r \sigma_i u_u v_i^T = \alpha_1 u_1 v_1^T + \cdots + \alpha_r u_r v_r^T.\]
We can truncate this sum when we feel that the singular values are so small that they are not contributing much.

SVD components

Matrix Approximation Error

So if we truncate the sum after \(k\) singular values, then we are approximating \(A\) with \(A_k\)

\[A_k = \Sigma_{i = 1}^{k<r} \sigma_i u_u v_i^T = \alpha_1 u_1 v_1^T + \cdots + \alpha_k u_k v_k^T.\]

The error in the approximation is \(\sigma_{k+1}\).
We have \(\text{rank}(A_k) = k < r = \text{rank}(A).\)
By approximating \(A\) with \(A_k\), we have saved memory:
- \(A\) requires us to store \(m\times n\) numbers
- \(A_k\) requires us to store ??? numbers?

– Significant savings for large matrices, and/or small \(k\).

Image Compression

Suppose we have a grayscale image (\(128 \times 128\) pixels).
- We can use a matrix to represent this image.
If we have a colour image, it has three matrices, with the same size as the image. Each matrix represents a color value that comprises the RGB color scale.
Each pixel is represented by an integer from 0-255.
Next, we can decompose the matrix by SVD.
By eliminating small singular values, we can approximate the matrix.
- Choose the value of \(k\) for the low-rank approximation \(A_k\).
- Plotting the singular values might help identify where there is a noticeable drop in significance

Reconstructing approximated image

Suppose we have chosen the value of \(k\) = number of singular values we wish to retain. - We can generate a new image matrix by expanding \(A\) using the SVD (first \(k\) singular values only).
If you want to use colour images, do it for R, G, B matrices separately and then reassemble.

Data compression in R

Here are four images with rank 286, 200, 116 and 32. How many numbers to store for each of them?

Plot of singular values

Lab session

Like in today’s lecture, take a high resolution image of yourself, and produce a sequence of low rank approximations.
How much can you save?

Noise reduction

The SVD also has applications in digital signal processing
The central idea is to let a matrix \(A\) represent the noisy signal, compute the SVD, and then discard small singular values of \(A\).
It can be shown that the small singular values mainly represent the noise, and thus the rank \(k\) matrix \(A_k\) represents a filtered signal with less noise.

Application to Signal Separation

Suppose we have \(P\) observed signals \(m_i(t)\) which are linear combinations of r source signals \(s_i(t)\), corrupted by noise signals \(n_i(t)\). We have a time series of data from \(T\) time periods
This can be written as: \[m_i(t) = \alpha_{i1}s_1(t) + \cdots + a_{ir}s_r(t) + n_i(t),\] where \(t = 1, \cdots, T\) and \(i = 1, \cdots, P\).
Write the equivalent matrix representation.

SVD of signal matrix \(M\)

We can decompose \(M\) using SVD: \(M = U\Sigma V^T\).
If the signal compared to the noise is sufficiently strong, we are likely to find clear distinction between the singular values due to the signal and those due to noise \[U = (U_s, U_n),~\Sigma = \left(\begin{array}{cc} \Sigma_s & \\ & \Sigma_n \end{array}\right),~V^T = \left(\begin{array}{c} V_s \\ V_n \end{array}\right)\]

Results of noise reduction

library(bootSVD)
set.seed(1)
Y <- simEEG(n = 100, centered = TRUE, propVarNoise = 0.3, wide = TRUE)
svdY <- svd(Y)
par(mfrow = c(2, 2))
for (j in c(3, 20, 50, 100)) {
    a = svdY$u[, 1:j] %*% diag(svdY$d[1:j]) %*% t(svdY$v[, 1:j])
    plot(a[1, ], type = "l", xlab = "", ylab = "signal", main = paste("r = ", 
        j, sep = ""))
}

Curve fitting

Curve fitting seeks to approximate the data by fitting a curve that minimises the sum of the squared residual errors.
Over-determined linear system.
No exact solutions.
We find an approximation.

Linear regression

For a linear regression, \[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon},~\mathbf{\epsilon}\sim N(0,\Sigma).\] what is its OLS solution?

Numerical Issues

Numerically, there can be problems with solving for \(\mathbf{\beta}\) via the normal equations.
- The inverse of \(X^TX\) might not exist
- It is not computationally efficient to invert it for large dimensions
- \(X^TX\) maybe ill-conditioned
The SVD can help us by providing a numerically stable pseudo-inverse, and the opportunity to identify (through the singular values) any ill-conditioning.

Pseudo-inverse

Recall that the SVD of a matrix \(A\) in \(\mathcal{R}_{m\times n}\) decomposes the matrix into \[A = U\Sigma V^T = \left(\begin{array}{cc}U_1 & U_2\end{array}\right) \left(\begin{array}{cc} D & \\ & 0 \end{array}\right) \left(\begin{array}{cc}V_1^T & V_2^T\end{array}\right) = U_1DV_1^T\] where \(D = \text{diag}\{\alpha_1, \cdots, \alpha_r\} \in \mathcal{R}^{r\times r}\) with \(\sigma_r > 0\) for \(r = \text{rank}(A)\).

The pseudo-inverse, or generalised inverse, of \(A\) can always be calculated, even if \(A\) is not invertible. \[A^+ = V_1D^{-1}U_1^T\]
\(A^{+}\) is \(n \times m\).

Linear regression

For \[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\varepsilon},~\mathbf{\epsilon}\sim N(0,\Sigma),\] the solution is \[\beta = X^{+}y = V_1D^{-1}U_1^T y = \Sigma_{i = 1}^r \frac{v_iu_i^Ty}{\sigma_i}.\] A small singular value means trouble. A lower rank approximation should be more reliable.

An example of using generalised inverse

X <- matrix(c(1, 2, 4, 8, 3, 6, 9, 12, -11, -22, -32, -40), 4, 
    3)
y <- 2 * X[, 1] - 7 * X[, 2] + 1e-04 * (rnorm(4) - 0.5)
library(MASS)
beta <- ginv(X) %*% y  # write your own function to find generalised inverse
t(y - X %*% beta) %*% (y - X %*% beta)

##              [,1]
## [1,] 3.845924e-08

Lab session

Use generalised inverse to solve a linear regression problem, and compare with the lm() function in R.