Data Processing and Statistical Analysis

School of Economics and Management
Beihang University
http://yanfei.site

In this lecture we are going to learn…

Principal component analysis (PCA)
Cluster analysis

PCA

Why PCA?

In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. The majority of the variance is often contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.

Therefore, PCA is often used for:

Dimension reduction
Data visualisation for multivariate data

Case study: PCA

Check out the details about the dataset we will use (?BreastCancer).

library(mlbench)
data("BreastCancer")
breast.cancer.raw = BreastCancer[complete.cases(BreastCancer),]
breast.cancer.data = subset(breast.cancer.raw, select = -c(Id, Class))
scaled.breast.cancer.data = scale(sapply(breast.cancer.data, as.numeric))
breast.cancer.pc.cr <- princomp(scaled.breast.cancer.data)
breast.cancer.PC1 <- breast.cancer.pc.cr$scores[, 1]
breast.cancer.PC2 <- breast.cancer.pc.cr$scores[, 2]
summary(breast.cancer.pc.cr)

## Importance of components:
##                           Comp.1     Comp.2     Comp.3     Comp.4
## Standard deviation     2.4284638 0.87447849 0.73363286 0.67929244
## Proportion of Variance 0.6562315 0.08509266 0.05988959 0.05134609
## Cumulative Proportion  0.6562315 0.74132419 0.80121379 0.85255988
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.61642744 0.54969653 0.54234271 0.51036808
## Proportion of Variance 0.04228222 0.03362326 0.03272966 0.02898417
## Cumulative Proportion  0.89484209 0.92846535 0.96119501 0.99017918
##                             Comp.9
## Standard deviation     0.297082498
## Proportion of Variance 0.009820825
## Cumulative Proportion  1.000000000

breast.cancer.pc.cr$loadings

## 
## Loadings:
##                 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Cl.thickness     0.302  0.148  0.865  0.107         0.245         0.246
## Cell.size        0.381               -0.205 -0.143  0.122  0.216 -0.437
## Cell.shape       0.377               -0.175 -0.106         0.135 -0.583
## Marg.adhesion    0.333        -0.413  0.487         0.666         0.159
## Epith.c.size     0.336 -0.159        -0.437 -0.634        -0.213  0.456
## Bare.nuclei      0.335  0.255         0.503 -0.127 -0.585 -0.436 -0.124
## Bl.cromatin      0.346  0.227 -0.215         0.231 -0.332  0.681  0.390
## Normal.nucleoli  0.335        -0.134 -0.413  0.693        -0.461       
## Mitoses          0.233 -0.907         0.253  0.102 -0.150  0.123       
##                 Comp.9
## Cl.thickness          
## Cell.size        0.733
## Cell.shape      -0.667
## Marg.adhesion         
## Epith.c.size          
## Bare.nuclei           
## Bl.cromatin           
## Normal.nucleoli       
## Mitoses               
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.111  0.111  0.111  0.111  0.111  0.111  0.111  0.111
## Cumulative Var  0.111  0.222  0.333  0.444  0.556  0.667  0.778  0.889
##                Comp.9
## SS loadings     1.000
## Proportion Var  0.111
## Cumulative Var  1.000

library(ggplot2)
qplot(breast.cancer.PC1, breast.cancer.PC2)

bc.class <- as.factor(breast.cancer.raw$Class)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = bc.class)

Cluster analysis

Motivation

In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
In City-planning for identifying groups of houses according to their type, value and location.

Clustering analysis

Partitioning Clustering (\(k\)-means or pam-partitioning around medoids, etc.)
- E.g.,\(k\)-means clustering - aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.
Hierarchical Clustering
1. Agglomerative: This is a “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
2. Divisive: This is a “top down” approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
Others like spectral clustering etc.

\(k\)-means clustering

Breast cancer data again

library(ggfortify)
set.seed(1)
km.breast.cancer <- kmeans(scaled.breast.cancer.data, 2)

## visualizing clusters using pca
Cluster <- as.factor(km.breast.cancer$cluster)
qplot(breast.cancer.PC1, breast.cancer.PC2, col = Cluster)

Hierachical clustering

You do not need to specify \(k\) before clustering.

hc.breast.cancer <- hclust(dist(scaled.breast.cancer.data))
hc.cluster <- cutree(hc.breast.cancer, k = 2)
plot(hc.breast.cancer,hang = -1, cex = 0.6)
rect.hclust(hc.breast.cancer, k = 2, border = "red")

A practical issue: picking the number of clusters

What value should \(k\) take? It’s best to take advantage of domain knowledge.
In the absence of a subject-matter knowledge, try a variety of heuristics, and perhaps a few different values of \(k\). Eg:
- The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance to the total within-cluster variance.
R offers various empirical approaches for selecting a value of \(k\). One such R tool for suggested best number of clusters is the NbClust package.

Clustering takeaways

In a good clustering, points in the same cluster should be more similar (nearer) to each other than they are to points in other clusters.
Ideally, you want a unit change in each coordinate to represent the same degree of change. One way to approximate this is to transform all the columns to have a mean value of 0 and a standard deviation of 1.0, for example by using the function scale().
Clustering is often used for data exploration. But you may want to use the clusters that you discovered to categorize new data, as well.
Different clustering algorithms will give different results. You should consider different approaches, with different numbers of clusters.
There are many heuristics for estimating the best number of clusters. Again, you should consider the results from different heuristics and explore various numbers of clusters.

References

Nina Zumel and John Mount (2014). Data Science with R. Manning.