What is the point of gene set or pathway enrichment analysis?

BY IN Notes, R, Tutorials NO COMMENTS YET ,

Gene set or pathway enrichment analysis is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states. There are many different ways to perform gene set or pathway enrichment analysis, from hypergeometric distribution tests to bootstrap based approaches. Gene set or pathway enrichment analysis can be particularly powerful when genes individually do not exhibit a statistically significant difference between two biological states, but when grouped together, show statistically significant concordant differences.

Here, we explore such a scenario through simulation using the Lightweight Iterative Gene set Enrichment in R (liger) package. To install liger, you can use devtools:

require(devtools)
devtools::install_github("JEFworks/liger")

To begin, we will simulate a weak differential expression within a known gene set between two biological samples.

set.seed(0)
library(liger)
# load gene set
data("org.Hs.GO2Symbol.list")  
# get universe
universe <- unique(unlist(org.Hs.GO2Symbol.list))
# get genes in a gene set
gs <- org.Hs.GO2Symbol.list[[1]]
# gene set name
names(org.Hs.GO2Symbol.list)[1] 
# make random data
Nsamples <- 100
Mgenes <- length(universe)
mat <- matrix(rnorm(Nsamples * Mgenes, 5, 10), Mgenes, Nsamples)
colnames(mat) <- paste0('sample', 1:Nsamples)
rownames(mat) <- universe
# let half the samples be in one biological state and the other half a different one
# simulate differential expression for genes in the gene set
mat[gs, 1:round(Nsamples/2)] <- rnorm(length(gs)*round(Nsamples/2), 10, 10)
# we can visualize this weak differential expression in a heatmap
# visualize weakly differentially expressed genes and another 50 genes
vi <- c(gs, universe[1:50]) 
# label supposedly differentially expressed genes 
heatmap(mat[vi,], Rowv=NA, Colv=NA, scale="none", 
        col=colorRampPalette(c("blue", "white", "red"))(100),
        RowSideColors = rainbow(2)[as.factor(vi %in% gs)]) 
sim-data-1

Even visually, it's somewhat difficult to tell which genes are supposedly differentially expressed. We can also quantify the extent of the differential expression between our two biological states using a T-test (or other statistical tests for assessing the significance of differential expression).

# run differential expression analysis
vals <- sapply(1:nrow(mat), function(i) {
  pv <- t.test(
    mat[i, 1:round(Nsamples/2)], 
    mat[i, round(Nsamples/2+1):Nsamples]
  )$p.val
  return(pv)
})
names(vals) <- rownames(mat)
vals <- -log10(vals)
# look at -log10(p values) for genes that are supposedly differentially expressed
barplot(sort(vals[gs], decreasing=TRUE), ylim=c(0, 10))
# multiple testing significance correction line
bonf <- function(a, n) { 1 - (1-a) ** (1/n) }
abline(h = -log10(bonf(0.05, nrow(mat))), col="red")
sim-diffexp-1

After multiple testing correction, none of the genes, including those we simulated to be differentially expressed, were actually picked up as significantly differentially expressed. In a real world situation, we may be tempted to end our analysis here and conclude that since nothing is significantly differentially expressed between the two biological states there is no significant difference.

However, we can perform gene set or pathway enrichment analysis, for example, on 100 different a priori defined gene sets to look for statistically significant concordant differences.

# run iterative bulk gsea
gseaVals <- iterative.bulk.gsea(values = vals, set.list = org.Hs.GO2Symbol.list[1:100])
# identify significant gene sets
gseaSig <- rownames(gseaVals[gseaVals$q.val < 0.05 & gseaVals$sscore > 0,])
gseaSig
# look at plots
for(i in seq_along(gseaSig)) {
  gs <- org.Hs.GO2Symbol.list[[gseaSig[i]]]
  gsea(values=vals, geneset=gs, mc.cores=1, plot=TRUE)
}
sim-gsea-1

Despite no individual gene being statistically significantly differentially expressed between our two biological states, enrichment analysis identifies a significantly enriched gene set, GO:0000002, which is exactly the gene set that we simulated to show concordant differences. Therefore, by pooling genes within these a priori defined gene sets, we are able to increase our statistical power to identify differences between our two biological states.

So, what do you think ?