Lecture 4: Unsupervised Methods



Yanfei Kang
yanfeikang@buaa.edu.cn

School of Economics and Management
Beihang University

In this lecture we are going to learn…



PCA

Why PCA?

In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. The majority of the variance is often contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.

Therefore, PCA is often used for:

  1. Dimension reduction

  2. Data visualisation for multivariate data

Case study: PCA

Check out the details about the dataset we will use (?BreastCancer).

library(mlbench)
data("BreastCancer")
breast.cancer.raw = BreastCancer[complete.cases(BreastCancer),]
breast.cancer.data = subset(breast.cancer.raw, select = -c(Id, Class))
scaled.breast.cancer.data = scale(sapply(breast.cancer.data, as.numeric))
breast.cancer.pc.cr <- princomp(scaled.breast.cancer.data)
breast.cancer.PC1 <- breast.cancer.pc.cr$scores[, 1]
breast.cancer.PC2 <- breast.cancer.pc.cr$scores[, 2]
summary(breast.cancer.pc.cr)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     2.4284638 0.87447849 0.73363286 0.67929244
## Proportion of Variance 0.6562315 0.08509266 0.05988959 0.05134609
## Cumulative Proportion  0.6562315 0.74132419 0.80121379 0.85255988
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.61642744 0.54969653 0.54234271 0.51036808
## Proportion of Variance 0.04228222 0.03362326 0.03272966 0.02898417
## Cumulative Proportion  0.89484209 0.92846535 0.96119501 0.99017918
##                             Comp.9
## Standard deviation     0.297082498
## Proportion of Variance 0.009820825
## Cumulative Proportion  1.000000000
breast.cancer.pc.cr$loadings
## 
## Loadings:
##                 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Cl.thickness    -0.302 -0.148  0.865 -0.107         0.245         0.246
## Cell.size       -0.381                0.205  0.143  0.122 -0.216 -0.437
## Cell.shape      -0.377                0.175  0.106        -0.135 -0.583
## Marg.adhesion   -0.333        -0.413 -0.487         0.666         0.159
## Epith.c.size    -0.336  0.159         0.437  0.634         0.213  0.456
## Bare.nuclei     -0.335 -0.255        -0.503  0.127 -0.585  0.436 -0.124
## Bl.cromatin     -0.346 -0.227 -0.215        -0.231 -0.332 -0.681  0.390
## Normal.nucleoli -0.335        -0.134  0.413 -0.693         0.461       
## Mitoses         -0.233  0.907        -0.253 -0.102 -0.150 -0.123       
##                 Comp.9
## Cl.thickness          
## Cell.size        0.733
## Cell.shape      -0.667
## Marg.adhesion         
## Epith.c.size          
## Bare.nuclei           
## Bl.cromatin           
## Normal.nucleoli       
## Mitoses               
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111
## Cumulative Var  0.111  0.222  0.333  0.444  0.556  0.667  0.778  0.889
##                Comp.9
## SS loadings     1.000
## Proportion Var  0.111
## Cumulative Var  1.000
library(ggplot2)
qplot(breast.cancer.PC1, breast.cancer.PC2)

bc.class <- as.factor(breast.cancer.raw$Class)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = bc.class)



Cluster analysis

Motivation

Clustering analysis

\(k\)-means clustering

Breast cancer data again

library(ggfortify)
set.seed(1)
km.breast.cancer <- kmeans(scaled.breast.cancer.data, 2)

# visualizing clusters using pca
Cluster <- as.factor(km.breast.cancer$cluster)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = Cluster)

Hierachical clustering

You do not need to specify \(k\) before clustering.

hc.breast.cancer <- hclust(dist(scaled.breast.cancer.data))
hc.cluster <- cutree(hc.breast.cancer, k = 2)
plot(hc.breast.cancer,hang = -1, cex = 0.6)
rect.hclust(hc.breast.cancer, k = 2, border = "red")