Bayesian Statistics and Computing

---

# Bayesian Statistics and Computing

## Lecture 17: Applications of SVD

#### *Yanfei Kang | BSC | Beihang University*

---
class: inverse, center, middle
# Text mining

---
# Raw human written text `$\Rightarrow$` Structured information

- The biggest difference between text mining and general data analysis is that it deals with text data, instead of numeric values.

- Sometimes text mining is called 'Natural Language Processing (NLP)', especially in computer science.

- Most text mining methods are based on word frequency in real world.

---
# What do you usually see in text mining?

- Corpus
    - a collection of documents (e.g., a collection of different job description documents)
    
- Word segment
    - segment each text into words
    - stopwords: common words that generally do not contribute to the meaning of a sentence, at least for the purposes of information retrieval and natural language processing. These are words such as the and a. Most search engines will filter out stopwords from search queries and documents in order to save space in their index.

---
# What do you usually see in text mining?

- DocumentTermMatrix
    - Each row is a document, while each column shows word frequencies of the corresponding word.
    - This is the very basic data structure for text mining.
    
- TermDocumentMatrix

- Text clustering
    - Group similar documents together according to their similarities.
    
- Topic models
    - Find topics which the corpus is talking about.
    
    
---
#  Latent Semantic Analysis (LSA)

- Extract relationships between the documents and terms assuming that terms that are close in meaning will appear in similar (i.e., correlated) pieces of text.

- LSA leverages a SVD factorization of a document-term matrix to extract these relationships. `$$A_{d\times t} = D_{d\times d}\Sigma_{d\times t} T_{t \times t}^T.$$`

- Each SV relates to a *semantic dimension* (*topic*).

- `$\Sigma$` gives importance of topic.

- `$D$` contains the eigenvectors of the document correlations, `$AA^T$`. It relates docs to topics.

- `$T$` contains the eigenvectors of the term correlations, `$A^TA$`. It relates terms to topics; value gives strength.

---
# LSA to the Rescue!

- LSA often remediates the curse of dimensionality problem in text analytics.

- The matrix factorization has the effect of combining columns, potentially enriching signal in the data.

- By selecting a fraction of the most important singular values, LSA can dramatically reduce dimensionality.
`$$A_{d\times t} \approx D_{d\times k}\Sigma_{k\times k} T_{t \times k}^T.$$`
- `$D_{d\times k}^TA_{d \times t} = \Sigma_{k\times k} T_{t \times k}^T$` are terms in semantic space.

- `$\Sigma_{k\times k} D_{d \times k}^T = T_{t \times k}^T A_{d \times t}^T$`  are docs in semantic space.   
---
# Lab session

```r
library(quanteda)
txt <- c(d1 = "The Google TM matrix P is a model of the Internet.",
         d2 = "Pij is nonzero if there is a link from Web page j to i.",
         d3 = "The Google matrix is used to rank all Web pages." ,
         d4= "The ranking is done by solving a matrix eigenvalue problem.",
         d5 = "England dropped out of the top 10 in the FIFA ranking.")
keywords <- c('eigenvalue', 'England', 'FIFA', 'Google', 'Internet', 'link', 'matrix', 
              'page', 'rank', 'Web', 'ranking', 'pages')
mydfm <- dfm(txt, tolower = FALSE)%>%
    dfm_match(features = keywords)
mydfm <- dfm_wordstem(mydfm)
mydfm
#> Document-feature matrix of: 5 documents, 10 features (66.00% sparse) and 0 docvars.
#>     features
#> docs eigenvalu England FIFA Googl Internet link matrix page rank Web
#>   d1         0       0    0     1        1    0      1    0    0   0
#>   d2         0       0    0     0        0    1      0    1    0   1
#>   d3         0       0    0     1        0    0      1    1    1   1
#>   d4         1       0    0     0        0    0      1    0    1   0
#>   d5         0       1    1     0        0    0      0    0    1   0
```

---
# Lab session

```r
# Perform svd
mylsa <- svd(as.matrix(mydfm))
mylsa
#> $d
#> [1] 2.8546460 1.8822858 1.7320508 1.2603301 0.8482714
#> 
#> $u
#>            [,1]       [,2]          [,3]        [,4]       [,5]
#> [1,] -0.3702377  0.1392818 -6.666667e-01  0.47220519 -0.4196456
#> [2,] -0.2912386 -0.7029673  3.333333e-01 -0.04360456 -0.5549816
#> [3,] -0.7497942 -0.1908556  1.443290e-15  0.03077540  0.6327999
#> [4,] -0.4067669  0.4573411 -2.081668e-15 -0.72810055 -0.3086249
#> [5,] -0.2246184  0.4907655  6.666667e-01  0.49400747 -0.1421548
#> 
#> $v
#>              [,1]        [,2]          [,3]        [,4]        [,5]
#>  [1,] -0.14249296  0.24297114 -1.401657e-15 -0.57770622 -0.36382805
#>  [2,] -0.07868521  0.26072846  3.849002e-01  0.39196673 -0.16758177
#>  [3,] -0.07868521  0.26072846  3.849002e-01  0.39196673 -0.16758177
#>  [4,] -0.39235405 -0.02739954 -3.849002e-01  0.39908639  0.25128072
#>  [5,] -0.12969654  0.07399609 -3.849002e-01  0.37466786 -0.49470677
#>  [6,] -0.10202265 -0.37346473  1.924501e-01 -0.03459773 -0.65424999
#>  [7,] -0.53484700  0.21557159 -3.849002e-01 -0.17861984 -0.11254733
#>  [8,] -0.36468016 -0.47486036  1.924501e-01 -0.01017920  0.09173749
#>  [9,] -0.48383568  0.40230396  3.849002e-01 -0.16132097  0.21457767
#> [10,] -0.36468016 -0.47486036  1.924501e-01 -0.01017920  0.09173749
```
]

---
# LSA

- Rank-lowering SVD on the TDM is used in the cluster of
related techniques known as Latent Semantic Analysis.

- The *big claim* for LSA that this captures the *semantic structure* of the collection.

- Matches by *topic*, not term.

- Automatically expands term into underlying topic.

- Allows semantically related documents (queries) to match, even if different terms used.

- Also referred to as *Latent Semantic Indexing*, or LSI.

---
# Document comparison

- Documents `$d_i$` and `$d_j$` can be compared using cosine distance on `$i$` and `$j$` columns of `$Z$`.

- Dense, rather than sparse.

- Compares by *concepts*.

- Clustering can also be done in semantic space.
]

---
# Term similarity

<center> 
![](figure/Cluster-WordCloud.jpg)

---
# Lab session

```r
# Documents in the semantic space
mylsa_docs <- t(mylsa$v)%*%t(as.matrix(mydfm))
# New query 
querydfm <- dfm(c("ranking of Web pages"), tolower = FALSE) %>%
    dfm_match(features = keywords)
querydfm <- dfm_wordstem(querydfm)
querydfm
#> Document-feature matrix of: 1 document, 10 features (70.00% sparse) and 0 docvars.
#>        features
#> docs    eigenvalu England FIFA Googl Internet link matrix page rank Web
#>   text1         0       0    0     0        0    0      0    1    1   1
# Calculate cosine similarity
apply(as.matrix(mydfm), 1 , lsa::cosine, y = as.vector(querydfm))
#>        d1        d2        d3        d4        d5 
#> 0.0000000 0.6666667 0.7745967 0.3333333 0.3333333
```

---
# Lab session

```r
# Query in the semantic space
newq <- t(mylsa$v) %*% as.vector(querydfm)
# Take the first two SVs
newquery <- newq[1:2]
# Calculate cosine similarity after projection
apply(mylsa_docs[1:2, ], 2, lsa::cosine, y = newquery)
#>        d1        d2        d3        d4        d5 
#> 0.7856726 0.8331890 0.9670101 0.4872887 0.1818875
```

---
class: inverse, center, middle
# Image Recognition

---

# Classification of handwritten digits

---
# Read images in R

```r
## The image matrix for training sample. 256x1707
azip <- read.table("azip.dat")
## The true digits given in the training sample. length = 1707
dzip <- as.numeric(read.table("dzip.dat"))
## The testing image matrix. 256x2007
testzip <- read.table("testzip.dat")
## The true digits for the testing sample. length = 2007
dtest <- read.table("dtest.dat")
## Display the image
i <- 120
image(matrix(azip[, i], ncol = 16)[, 16:1], col = gray(255:0/255))
```
]

.pull-right[
<img src="figure/plot-readimage-1.svg" width="504" style="display: block; margin: auto;" />
]

---
#  A naive method

The naive method is to check the distance from each test image to the mean of training image.

**Training: **Given the manually classified training set, compute the means (centroids) `$m_i$`, `$i = 0,...,9$`, of all the 10 classes.

**Classification: **For each digit in the test set, classify it as `$k$` if `$m_k$` is the closest
mean.

---
#  A naive method

```r
## The mean of training sample of a single digit
digits <- 0:9  # The possible digits in the US postal code
img.mean <- matrix(0, 256, length(digits))
for (i in digits) {
    idx <- (i == dzip)  # the location indicator for the ith digit
    imgi <- azip[, idx, drop = FALSE]
    imgi.mean <- rowMeans(imgi)
    img.mean[, i + 1] <- imgi.mean
}
## Plot the mean images
par(mfrow = c(2, 5))
for (i in 1:10) {
    image(matrix(img.mean[, i], ncol = 16)[, 16:1], col = gray(255:0/255))
}
```

---
#  The mean images
<img src="figure/plot-naive-1.svg" width="1008" style="display: block; margin: auto;" />

---
# The naive method

.scroll-output[
- Now it is the time to check the testing sample to the mean of the training sample. We pick the first five testing digits.

- We find the first, third and the fifth are rather easy to classify by eyeballs. But the second and fourth ones are particular difficult.

```r
## Sketch a distance function to compute the Euclidean distance between two matrices in row wise.
rdist <- function(X, Y) {
    dim.X <- dim(X)
    dim.Y <- dim(Y)
    sum.X <- matrix(rowSums(X^2), dim.X[1], dim.Y[1])
    sum.Y <- matrix(rowSums(Y^2), dim.X[1], dim.Y[1], byrow = TRUE)
    dist0 <- sum.X + sum.Y - 2 * tcrossprod(X, Y)
    out <- sqrt(dist0)
    return(out)
}

## For an unknown testing digit image, compare the distance to the means
test.sample <- 1:5

## Let's first plot those testing image
par(mfcol = c(ceiling(length(test.sample)/5), 5))  # five columns

for (i in test.sample) {
    image(matrix(testzip[, i], ncol = 16)[, 16:1], col = gray(255:0/255))
}

## Calculate the distance from testing sample to the mean in the training sample.
img.dist <- rdist(t(testzip[, test.sample]), t(img.mean))

## The classification results by the naive method
apply(img.dist, 1, which.min) - 1
```
]

---
# The naive method

```
#> V1 V2 V3 V4 V5 
#>  9  2  3  2  6
```

---
# Lab session

- What is the the success rate of this algorithm?

- Possible reason for the low success rate?

---
# Classification Using SVD Bases

- If we consider the images as 16 × 16 matrices, then the data are multidimensional.

-  Stacking all the columns of each image above each other gives a matrix.

- Now the idea is to “model” the variation within the set of training (and test) digits of one kind using an orthogonal basis of the subspace.

- An orthogonal basis can be computed using the SVD, and any matrix `$A$` is a sum of rank 1 matrices:
`$$A=\sum_{i=1}^{m} \sigma_{i} u_{i} v_{i}^{T}.$$`

---
# The SVD method

- We pick the digit 9 as an example in this method. Each column in A represents an image of a digit 9, and therefore the left singular vectors `$u_i$` are an orthogonal basis in the “image space of 9’s.”

- We will refer to the left singular vectors as “singular images”. Note that 
`$$a_{j}=\sum_{i=1}^{m}\left(\sigma_{i} v_{i j}\right) u_{i}.$$`
- Plot the first ten singular image from the SVD decomposition.

---
# The SVD method

```r
## Compute the singular matrix of a single digit in the training sample
digit <- 9

## Subtract the matrix for that digit
img.mat <- azip[, digit == dzip, drop = FALSE]
img.matSVD <- svd(img.mat)

## Plot the singular matrix under different basis.
par(mfrow = c(2, 5))
for (i in 1:10) {
    image(matrix(img.matSVD$u[, i], 16)[, 16:1], col = gray(255:0/255), main = paste("singular image ", i, sep = ""))
}

```

---
# The SVD method

---
# The SVD method

The SVD basis classification algorithm will be based on the following assumptions:

1. Each digit (in the training set and the test set) is well characterized by a few of the first singular images of its own kind. The more precise meaning of “few” should be investigated in experiments.

2. An expansion in terms of the first few singular images discriminates well between the different classes of digits.

3. If an unknown digit can be better approximated in one particular basis of singular images, the basis of 9’s say, than in the bases of the other classes, then it is likely that the unknown digit is a 9.

---
# The SVD method

- Thus we should compute how well an unknown digit can be represented in the 10 different bases. This can be done by computing the residual vector in least squares problems of the type
`$$\min _{\alpha_{i}}\left\|z-\sum_{i=1}^{k} \alpha_{i} u_{i}\right\|,$$`
where `$z$` represents an unknown digit and `$u_i$` represents the singular images.

- Using four bases yields the correct specification as follows. We also tries to classify other digits which gives robust results. But when we increase more basis function, there comes the risk of overfitting.

- It maybe not a good idea to use all the bases but one can always pick up the bases according to the first `$k$`th largest eigenvalues.

---
# An SVD basis classification algorithm

**Training:** For the training set of known digits, compute the SVD of each set of
digits of one kind.

**Classification:** For a given test digit, compute its relative residual in all 10 bases. If one residual is significantly smaller than all the others, classify as that. Otherwise give up.

---
# Testing based on SVD

```r
## Do the least square method with different basis and find the minimal residuals.

## The testing digit matrix
test.idx <- 2
image(matrix(testzip[, test.idx], 16)[, 16:1], col = gray(255:0/255), main = paste("Testing digit"))

resid.norm <- matrix(NA, 10, 1, dimnames = list(0:9, "resid"))
for (i in 0:9) {
    img.mat <- azip[, i == dzip, drop = FALSE]
    img.matSVD <- svd(img.mat)
    # basis.max <- ncol(img.matSVD$u)
    basis.max <- 4
    resid.norm[i + 1, ] <- norm(matrix(lm(testzip[, test.idx] ~ 0 + img.matSVD$u[, 1:basis.max])$resid), "F")
}
resid.norm
#>       resid
#> 0 11.815495
#> 1 12.536173
#> 2 11.388341
#> 3 12.706473
#> 4 12.141602
#> 5 12.926655
#> 6  9.455279
#> 7 12.577303
#> 8 12.061615
#> 9 12.551339
```

<img src="figure/unnamed-chunk-5-1.svg" width="432" style="display: block; margin: auto;" />
]

---
# The SVD method

.scroll-output[
- We will find out when we overfit (see the plot of classification success as a function of the number of basis vectors.)

- To see this, we loop over all testing observations and number of bases from 1 to 88, and then count the correct specification numbers.

<center>    
<img src="CorrectSpect.png" height="400px" />
</center>
]

---
# Lab: face recognition

- Collect each student's face images.

- Implement a classification/regression problem. Bear in mind the numerical linear algebra techniques.