Topic 2-06: Compare ORA and GSEA

We apply ORA and GSEA on the same p53 datasets and compare the results.

lt = readRDS(system.file("extdata", "p53_expr.rds", package = "GSEAtraining"))
expr = lt$expr
condition = lt$condition

We convert gene IDs to Entrez IDs in expr:

library(GSEAtraining)
expr = convert_to_entrez_id(expr)

##

##   gene id might be SYMBOL (p =  0.720 )

## 'select()' returned 1:many mapping between keys and columns

ORA

We need to perform t-test or similar DE test to extract DE genes:

We apply t-test to every gene.

p = apply(expr, 1, function(x) {
    x1 = x[condition == "WT"]
    x2 = x[condition == "MUT"]
    
    t.test(x1, x2)$p.value
})

Faster row-wise t-test:

library(genefilter)
tdf = rowttests(expr, factor(condition))  # the second must be a "factor"
tdf$fdr = p.adjust(tdf$p.value, "BH")
sum(tdf$fdr < 0.05)  # number of diff genes

## [1] 2

It seems there are not too many diff genes. We look at the distribution of the t-statistics:

plot(sort(tdf$statistic))

Instead of setting a cutoff for FDR, we set a cutoff for t-statistics, just to get enough number of diff genes.

sum(abs(tdf$statistic) > 2)

## [1] 478

First we perform ORA, using GO BP gene sets:

library(clusterProfiler)
library(org.Hs.eg.db)
l_sig = abs(tdf$statistic) > 2
tb_ora = enrichGO(gene  = rownames(expr)[l_sig],
                  OrgDb = org.Hs.eg.db,
                  ont   = "BP",
                  pAdjustMethod = "BH")
tb_ora = as.data.frame(tb_ora)  # only significant ones

Then we perform GSEA, using t-statistics as gene scores.

s = tdf$statistic
names(s) = rownames(tdf)   # s must have names (gene IDs)
s = sort(s, decreasing = TRUE)  # s must be pre-sorted
tb_gsea = gseGO(geneList = s, 
                OrgDb = org.Hs.eg.db,
                ont   = "BP",
                pAdjustMethod = "BH")

## using 'fgsea' for GSEA analysis, please cite Korotkevich et al (2019).

## preparing geneSet collections...

## GSEA analysis...

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : There were 2 pathways for which P-values were not calculated
## properly due to unbalanced (positive and negative) gene-level statistic values.
## For such pathways pval, padj, NES, log2err are set to NA. You can try to
## increase the value of the argument nPermSimple (for example set it nPermSimple
## = 10000)

## Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
## minSize, : For some of the pathways the P-values were likely overestimated. For
## such pathways log2err is set to NA.

## leading edge analysis...

## done...

tb_gsea = as.data.frame(tb_gsea)  # only significant ones

We compare the two significant GO lists:

library(eulerr)
plot(euler(list(ORA = tb_ora$ID, GSEA = tb_gsea$ID)), quantities = TRUE)

Zuguang Gu z.gu@dkfz.de

2025-03-02