School of Economics and Management
Beihang University
http://yanfei.site

In this lecture we are going to learn…

  • Principal component analysis (PCA)
  • Cluster analysis

PCA

Why PCA?

In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. The majority of the variance is often contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.

Therefore, PCA is often used for:

  1. Dimension reduction

  2. Data visualisation for multivariate data

Case study: PCA

Check out the details about the dataset we will use (?BreastCancer).

library(mlbench)
data("BreastCancer")
breast.cancer.raw = BreastCancer[complete.cases(BreastCancer),]
breast.cancer.data = subset(breast.cancer.raw, select = -c(Id, Class))
scaled.breast.cancer.data = scale(sapply(breast.cancer.data, as.numeric))
breast.cancer.pc.cr <- princomp(scaled.breast.cancer.data)
breast.cancer.PC1 <- breast.cancer.pc.cr$scores[, 1]
breast.cancer.PC2 <- breast.cancer.pc.cr$scores[, 2]
summary(breast.cancer.pc.cr)
## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     2.4284638 0.87447849 0.73363286 0.67929244
## Proportion of Variance 0.6562315 0.08509266 0.05988959 0.05134609
## Cumulative Proportion  0.6562315 0.74132419 0.80121379 0.85255988
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.61642744 0.54969653 0.54234271 0.51036808
## Proportion of Variance 0.04228222 0.03362326 0.03272966 0.02898417
## Cumulative Proportion  0.89484209 0.92846535 0.96119501 0.99017918
##                             Comp.9
## Standard deviation     0.297082498
## Proportion of Variance 0.009820825
## Cumulative Proportion  1.000000000
breast.cancer.pc.cr$loadings
## 
## Loadings:
##                 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Cl.thickness     0.302  0.148  0.865  0.107         0.245         0.246
## Cell.size        0.381               -0.205 -0.143  0.122  0.216 -0.437
## Cell.shape       0.377               -0.175 -0.106         0.135 -0.583
## Marg.adhesion    0.333        -0.413  0.487         0.666         0.159
## Epith.c.size     0.336 -0.159        -0.437 -0.634        -0.213  0.456
## Bare.nuclei      0.335  0.255         0.503 -0.127 -0.585 -0.436 -0.124
## Bl.cromatin      0.346  0.227 -0.215         0.231 -0.332  0.681  0.390
## Normal.nucleoli  0.335        -0.134 -0.413  0.693        -0.461       
## Mitoses          0.233 -0.907         0.253  0.102 -0.150  0.123       
##                 Comp.9
## Cl.thickness          
## Cell.size        0.733
## Cell.shape      -0.667
## Marg.adhesion         
## Epith.c.size          
## Bare.nuclei           
## Bl.cromatin           
## Normal.nucleoli       
## Mitoses               
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111
## Cumulative Var  0.111  0.222  0.333  0.444  0.556  0.667  0.778  0.889
##                Comp.9
## SS loadings     1.000
## Proportion Var  0.111
## Cumulative Var  1.000
library(ggplot2)
qplot(breast.cancer.PC1, breast.cancer.PC2)

bc.class <- as.factor(breast.cancer.raw$Class)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = bc.class)

Cluster analysis

Motivation

  • In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.

  • In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.

  • In City-planning for identifying groups of houses according to their type, value and location.

Clustering analysis

  • Partitioning Clustering (\(k\)-means or pam-partitioning around medoids, etc.)
    • E.g.,\(k\)-means clustering - aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
  • Hierarchical Clustering
    1. Agglomerative: This is a “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
    2. Divisive: This is a “top down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
  • Others like spectral clustering etc.

\(k\)-means clustering

Breast cancer data again
library(ggfortify)
set.seed(1)
km.breast.cancer <- kmeans(scaled.breast.cancer.data, 2)

## visualizing clusters using pca
Cluster <- as.factor(km.breast.cancer$cluster)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = Cluster)