Principal Components

Sometimes, we need to analyze a data set containing a lot of variables. Visualizing and interpreting correlations of more than three variables quickly becomes an impossible task as the number of variables increases.

The variables that cause the most variation in the data set can be transformed to represent the variables that cause smaller variations. In a sense, the information from the variables with smaller variation are mapped onto those with higher variation. Thus, these smaller varying variables be disregarded from the model as their own entities. The transformed variables are what we call principal components. These principal component attempt to reduce dimensionality, while also preserving vital information within our data.

In this notebook, we will find the first 2 principal components of a data set containing 7128 gene expressions for 72 patients with leukaemia.

The data set leukaemia_big.csvcontains two different leukaemia types, AML and ALL. Our goal is to find some correlation between gene expressions that separates these type cancer types.

Start by loading the data:

In [1]:
data.leuk = read.csv("leukemia_big.csv", header=TRUE)  # local file for quicker loading 
# Transpose
X <- t(data.leuk)

This dataset needed to be transposed, since it's original format had our variables as rows, instead of columns.

Principal Component Analysis in R

We can now run a principal component analysis. We will only be interested in the information in the first two.

In [4]:
pc <- prcomp(X, scale =TRUE, center = TRUE)
# summary(pc)
# pc.std <- prcomp(X.std, scale=TRUE, center=TRUE)
# summary(pc.std) # This is here for comparison (in case it's needed later)

We'd now like to plot the first two components against each other, to see if the different cancer types having any specific grouping. This means we should mark our two cancer types with different colors on our plot.

In [3]:
# Seperating cancer types
pc.all.1 <- pc$x[,1][grepl("ALL", names(pc$x[,1]))]
pc.all.2 <- pc$x[,2][grepl("ALL", names(pc$x[,2]))]
pc.aml.1 <- pc$x[,1][grepl("AML", names(pc$x[,1]))]
pc.aml.2 <- pc$x[,2][grepl("AML", names(pc$x[,2]))]
# There might be a better way to do this


# Plotting data with colored labels
plot(pc$x[,1], pc$x[,2], xlab='PC1', ylab='PC2', main='First two principal components')
points(pc.all.1, pc.all.2, col='red')
points(pc.aml.1, pc.aml.2, col='blue')
legend('topright', cex=0.7, col=c('red', 'blue'), pch=1, legend=c('ALL', 'AML'),)

As a very rough estimate, This plot shows us that we may be on to something here. Although it doesn't group the two cancer types together with 100% accuracy, we can see a general grouping of blue dots to the top left, and a gathering of red dots to the bottom right. A more thorough analysis is needed here. We can try looking at even more components, or maybe try a lasso or ridge regression approach.