- Principal component analysis (PCA)
- Cluster analysis
School of Economics and Management
Beihang University
http://yanfei.site
In PCA, the dataset is transformed from its original coordinate system to a new coordinate system. The first new axis is chosen in the direction of the most variance in the data. The second axis is orthogonal to the first axis and in the direction of an orthogonal axis with the largest variance. The majority of the variance is often contained in the first few axes. Therefore, we can ignore the rest of the axes, and we reduce the dimensionality of our data.
Therefore, PCA is often used for:
Dimension reduction
Data visualisation for multivariate data
Check out the details about the dataset we will use (?BreastCancer
).
library(mlbench) data("BreastCancer") breast.cancer.raw = BreastCancer[complete.cases(BreastCancer),] breast.cancer.data = subset(breast.cancer.raw, select = -c(Id, Class)) scaled.breast.cancer.data = scale(sapply(breast.cancer.data, as.numeric)) breast.cancer.pc.cr <- princomp(scaled.breast.cancer.data) breast.cancer.PC1 <- breast.cancer.pc.cr$scores[, 1] breast.cancer.PC2 <- breast.cancer.pc.cr$scores[, 2] summary(breast.cancer.pc.cr)
## Importance of components: ## Comp.1 Comp.2 Comp.3 Comp.4 ## Standard deviation 2.4284638 0.87447849 0.73363286 0.67929244 ## Proportion of Variance 0.6562315 0.08509266 0.05988959 0.05134609 ## Cumulative Proportion 0.6562315 0.74132419 0.80121379 0.85255988 ## Comp.5 Comp.6 Comp.7 Comp.8 ## Standard deviation 0.61642744 0.54969653 0.54234271 0.51036808 ## Proportion of Variance 0.04228222 0.03362326 0.03272966 0.02898417 ## Cumulative Proportion 0.89484209 0.92846535 0.96119501 0.99017918 ## Comp.9 ## Standard deviation 0.297082498 ## Proportion of Variance 0.009820825 ## Cumulative Proportion 1.000000000
breast.cancer.pc.cr$loadings
## ## Loadings: ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 ## Cl.thickness 0.302 0.148 0.865 0.107 0.245 0.246 ## Cell.size 0.381 -0.205 -0.143 0.122 0.216 -0.437 ## Cell.shape 0.377 -0.175 -0.106 0.135 -0.583 ## Marg.adhesion 0.333 -0.413 0.487 0.666 0.159 ## Epith.c.size 0.336 -0.159 -0.437 -0.634 -0.213 0.456 ## Bare.nuclei 0.335 0.255 0.503 -0.127 -0.585 -0.436 -0.124 ## Bl.cromatin 0.346 0.227 -0.215 0.231 -0.332 0.681 0.390 ## Normal.nucleoli 0.335 -0.134 -0.413 0.693 -0.461 ## Mitoses 0.233 -0.907 0.253 0.102 -0.150 0.123 ## Comp.9 ## Cl.thickness ## Cell.size 0.733 ## Cell.shape -0.667 ## Marg.adhesion ## Epith.c.size ## Bare.nuclei ## Bl.cromatin ## Normal.nucleoli ## Mitoses ## ## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 ## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ## Proportion Var 0.111 0.111 0.111 0.111 0.111 0.111 0.111 0.111 ## Cumulative Var 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 ## Comp.9 ## SS loadings 1.000 ## Proportion Var 0.111 ## Cumulative Var 1.000
library(ggplot2) qplot(breast.cancer.PC1, breast.cancer.PC2)
bc.class <- as.factor(breast.cancer.raw$Class) qplot(breast.cancer.PC1, breast.cancer.PC2, col = bc.class)
In cancer research for classifying patients into subgroups according their gene expression profile. This can be useful for identifying the molecular profile of patients with good or bad prognostic, as well as for understanding the disease.
In marketing for market segmentation by identifying subgroups of customers with similar profiles and who might be receptive to a particular form of advertising.
In City-planning for identifying groups of houses according to their type, value and location.
library(ggfortify) set.seed(1) km.breast.cancer <- kmeans(scaled.breast.cancer.data, 2) ## visualizing clusters using pca Cluster <- as.factor(km.breast.cancer$cluster) qplot(breast.cancer.PC1, breast.cancer.PC2, col = Cluster)
You do not need to specify \(k\) before clustering.
hc.breast.cancer <- hclust(dist(scaled.breast.cancer.data)) hc.cluster <- cutree(hc.breast.cancer, k = 2) plot(hc.breast.cancer,hang = -1, cex = 0.6) rect.hclust(hc.breast.cancer, k = 2, border = "red")
In a good clustering, points in the same cluster should be more similar (nearer) to each other than they are to points in other clusters.
Ideally, you want a unit change in each coordinate to represent the same degree of change. One way to approximate this is to transform all the columns to have a mean value of 0 and a standard deviation of 1.0, for example by using the function scale()
.
Clustering is often used for data exploration. But you may want to use the clusters that you discovered to categorize new data, as well.
Different clustering algorithms will give different results. You should consider different approaches, with different numbers of clusters.
There are many heuristics for estimating the best number of clusters. Again, you should consider the results from different heuristics and explore various numbers of clusters.