PCA in R

Principle component analysis (PCA) is a must-use exploratory anlaysis tools in informatic sciences. And usually can be helpful in identifying how samples are clustered and what features introduced most variance among samples. An article that I have read recently provided an excellent example of how PCA can be performed in excel. This would be mind blowing to anyone like me, who contempts the usage of excel.

In here, I would like to use R to demonstrate how pca can be done without using the standard prcomp function as well as to strengthen my understanding on PCA.

#load library
library(dplyr)
library(data.table)
library(ggplot2)

set.seed(10) #reporoducible

#simulated data
# center at mean
a = matrix(rnorm(40,1:10),8) %>%
	apply(2,function(x) x-mean(x))  #scale each columns
a
##            [,1]       [,2]      [,3]       [,4]       [,5]
## [1,] -3.1033267  2.1790325  1.209696 -1.5794728 -2.0270496
## [2,] -2.3063254  4.5492268  2.969489  0.3120637 -1.8106900
## [3,] -2.4934034 -3.0925153  5.090161  0.9981698 -2.3755596
## [4,] -0.7212405 -2.4385133  5.647618  1.8135664 -1.4311303
## [5,]  1.1724723 -2.4325284 -4.431671  3.5839642  1.3664712
## [6,]  2.2677215 -0.2068501 -5.020647  4.4319447  0.2452973
## [7,]  1.6698510  0.5470953 -2.510226 -6.1680152  2.6798400
## [8,]  3.5142512  0.8950525 -2.954421 -3.3922208  3.3528210
cov(a)
##           [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  6.191805 -1.0066783 -8.3605042 -1.0833179  5.1197227
## [2,] -1.006678  6.8591512 -0.7953364 -3.6538409 -0.1036576
## [3,] -8.360504 -0.7953364 18.2804297  0.2443067 -7.4176214
## [4,] -1.083318 -3.6538409  0.2443067 12.7022845 -3.4640454
## [5,]  5.119723 -0.1036576 -7.4176214 -3.4640454  5.0613302
eigen(cov(a)) # eigen vectors of the covariance matrix are the principle components
## $values
## [1] 26.6438538 14.6551549  6.2003003  1.4618961  0.1337959
## 
## $vectors
##             [,1]        [,2]       [,3]      [,4]        [,5]
## [1,]  0.43071650 -0.07326935 -0.2687093 0.6854321  0.51681041
## [2,]  0.03673689  0.39997155  0.8290116 0.3853576 -0.05396735
## [3,] -0.79308362  0.20952906 -0.2953572 0.4796553 -0.09905006
## [4,] -0.15629592 -0.88202683  0.2951085 0.2679626 -0.19674154
## [5,]  0.39965443  0.11305136 -0.2573134 0.2825051 -0.82551583
# eigen value is the variance that is explained by each 


eVal <- eigen(cov(a))$value
varianceExplained <- eVal/sum(eVal)
varianceExplained
## [1] 0.542699935 0.298506051 0.126291887 0.029776882 0.002725244
eVec <-  eigen(cov(a))$vector
rotation <- eVec %>%
	data.table %>%
	setnames(sapply(1:ncol(.),function(x) paste0('PC',x)))


# transformed data matrix
# this can alternatively done by a %*% eVec
PC1 <- rep(0,nrow(a))
PC2 <- rep(0,nrow(a))
for (i in 1:length(PC1)){
	PC1[i] <- sum(eVec[,1]*a[i,]) 
	PC2[i] <- sum(eVec[,2]*a[i,])
}
result <- data.table(PC1,PC2,
		method='manual',
		sample=1:length(PC1)) 
result
##          PC1       PC2 method sample
## 1: -2.779247  2.516373 manual      1
## 2: -3.953726  2.130789 manual      2
## 3: -6.329896 -1.136664 manual      3
## 4: -5.734678 -1.500556 manual      4
## 5:  3.916282 -4.994083 manual      5
## 6:  4.356278 -5.182223 manual      6
## 7:  4.765196  5.313823 manual      7
## 8:  5.759791  2.852542 manual      8
# using standard prcomp function 
pca <- prcomp(a)
summary(pca)$importance # variance explained
##                             PC1      PC2     PC3      PC4       PC5
## Standard deviation     5.161768 3.828205 2.49004 1.209089 0.3657812
## Proportion of Variance 0.542700 0.298510 0.12629 0.029780 0.0027300
## Cumulative Proportion  0.542700 0.841210 0.96750 0.997270 1.0000000
# combine result
resultTable <- pca$x %>% 
		data.table %>%
		select(PC1,PC2) %>%
		mutate(method = 'prcomp',
			sample = 1:nrow(.)) %>%
		rbind(result) 

# visualize in biplot
p1 <- ggplot() +
		geom_text(data=resultTable,
			aes(x=PC1,y=PC2,label=sample,
				color=factor(sample))) +
		facet_wrap(~method)+
		theme(legend.position = 'none')
p1

plot of chunk unnamed-chunk-1

#sometimes inverted PCs due to random assign of negative sign
#now add arrow for rotation
library(grid)
variables = rep(sapply(1:ncol(a), function(x) paste('Variable',x)))
rotationData <- data.table(rotation,method='manual',var = variables) %>%
		rbind(data.table(pca$rotation,method='prcomp',var = variables))

p1 + geom_segment(data = rotationData,
		aes(x = 0,y=0,xend = PC1*3, yend = PC2*3),
		color='red' ,
		arrow = arrow(length = unit(0.25, "cm"), 
				type = "closed",
				angle = 30)) +
	geom_text(data=rotationData,aes(x=PC1*3,y=PC2*3,label=var),color='red')

plot of chunk unnamed-chunk-1

Long story short, R is still better than Excel.




Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License. If you liked this post, you can share it with your followers or follow me on Twitter!