7 min read

Speed up over-representation enrichment analysis

Over-representation analysis (ORA) evaluates whether a list of genes (e.g. differentially expressed genes, DE genes) are enriched in a gene set, normally with the hypergeometric distribution to calculate p-values. With the function phyper(), there are the following parameters:

phyper(x, m, n, k, lower.tail = FALSE)

where x is the number of DE genes in the gene set, m is the total number of genes in the gene set, n is the total number of genes not in the gene set, and k is the total number of DE genes.

phyper(..., lower.tail = FALSE) calculates \(\mathrm{Pr}(X > x)\). Normally we also include the case of \(X = x\) which gives the final p-value as \(\mathrm{Pr}(X \ge x)\). Then we need to slightly change the phyper() call by switching x to x-1:

phyper(x - 1, m, n, k, lower.tail = FALSE)

For a DE gene list and a gene set, it is easy to obtain values for these parameters. First I implement a function ora_single() that calculates p-value for a single gene set. As demonstrated as follows, ora_single() has three arguments, which correspond to the DE genes, gene set and the background genes. Sometimes the three types of genes may come from different sources, to make sure background genes always cover all DE genes and the gene set, we need to explicitly intersect the genes and gene_set vectors to the universe vector.

# assume `genes`, `gene_set` and `universe` have the same gene ID type
ora_single = function(genes, gene_set, universe) {
    n_universe = length(universe)

    genes = intersect(genes, universe)
    gene_set = intersect(gene_set, universe)

    x = length(intersect(genes, gene_set)) # DE genes in the gene set
    m = length(gene_set)  # total genes in the gene set
    n = n_universe - m    # total genes not in the gene set
    k = length(genes)     # total DE genes

    phyper(x - 1, m, n, k, lower.tail = FALSE)

Then, for a list of gene sets, we can apply ora_single() to each of them, with sapply() or in a for loop. In ora_v1(), I assume the gene sets are represented as a list of vectors where each vector is a gene set. ora_v1() returns a vector of p-values.

# version 1
ora_v1 = function(genes, gene_sets, universe) {
    sapply(gene_sets, function(x) ora_single(genes, x, universe))

To execute ora_v1(), I use GO gene sets in the BP namespace, I use all protein-coding genes as background, and I randomly generate 1000 genes as DE genes.

gs = as.list(org.Hs.egGO2ALLEGS)
gs = lapply(gs, unique)  # genes may duplicate
gs = gs[Ontology(names(gs)) == "BP"]  # only take the BP GO gene sets

pc_genes = select(org.Hs.eg.db, key = "protein-coding", 
    keytype = "GENETYPE", column = c("ENTREZID"))[, 2] # all protein-coding genes
genes = sample(pc_genes, 1000)

This random dataset is also a good example of showing the necessarity of applying intersection in ora_single(). Here pc_genes does not completely cover genes in gs.

length(setdiff(unlist(gs), pc_genes))
## [1] 1705

The reason is gs also comtains microRNAs.

Run ora_v1():

system.time(p1 <- ora_v1(genes, gs, pc_genes))
##    user  system elapsed 
##  11.054   2.011  13.070

It takes more than 10 seconds.

In version 1, we analyzed gene sets in sapply(), in a looping manner. In R, vectorization is always a good thing. Note phyper() also accepts vectors as input. In version 2, I generate the parameters x, m and n (k is the number of DE genes and is the same for all gene sets, so a scalar for k is enough) for phyper() all as vectors.

# version 2
ora_v2 = function(genes, gene_sets, universe) {
    genes = intersect(genes, universe)
    gene_sets = lapply(gene_sets, function(x) intersect(x, universe))

    n_universe = length(universe)
    n_genes = length(genes)
    x = sapply(gene_sets, function(x) length(intersect(x, genes)))
    m = sapply(gene_sets, length)
    n = n_universe - m
    k = n_genes
    phyper(x - 1, m, n, k, lower.tail = FALSE)

With the same genes, gs and pc_genes variables, run ora_v2().

system.time(p2 <- ora_v2(genes, gs, pc_genes))
##    user  system elapsed 
##   4.814   0.589   5.404

It speeds up more than 2 folds compared to ora_v1(), but still takes several seconds. Next we can perform code profiling on ora_v2():

p2 = ora_v2(genes, gs, pc_genes)
##                      self.time self.pct total.time total.pct
## "base::intersect"         4.84    93.08       5.06     97.31
## "duplicated.default"      0.14     2.69       0.14      2.69
## "intersect"               0.12     2.31       5.18     99.62
## "duplicated"              0.04     0.77       0.18      3.46
## "c"                       0.04     0.77       0.04      0.77
## "phyper"                  0.02     0.38       0.02      0.38

The Profiling result shows intersect() call is the most time-consuming part in the code, which uses more than 95% of all runtime. Not difficult to see, especially in this line (the second line in ora_v2()):

    gene_sets = lapply(gene_sets, function(x) intersect(x, universe))

for every gene set as x, we do intersection to universe which is a extremely long vector, The two vectors x and universe need to be fully scanned for every gene set.

A thought to optimize the code is if we can change universe to a hash table, then in every gene set, we don’t need to go through every element in universe, while we just check whether the current gene in the gene set exsits in the hash table.

Based on this idea, we can implement it with Cpp code. Here unordered_set is a hash table data structure.

sourceCpp(code = '
// [[Rcpp::plugins(cpp11)]]

#include <Rcpp.h>
#include <unordered_set>
using namespace Rcpp;

// [[Rcpp::export]]
List intersectToList(List lt, StringVector x) {

    int n = lt.size();
    List out(n);

    std::unordered_set<String> seen;
    seen.insert(x.begin(), x.end());

    for(int i = 0; i < n; i++) {
        StringVector v = as<StringVector>(lt[i]);
        LogicalVector l(v.size());

        std::unordered_set<String> seen2;

        for(int j = 0; j < v.size(); j ++) {
            l[j] = seen.find(v[j]) != seen.end() && seen2.insert(v[j]).second;

        out[i] = v[l];

    return out;

The new function intersectToList() can be called in R with two arguments. The first one is a list of vectors (lt) and the second one is a single vector (x) which will intersect to every vector in the list. In the Cpp code, I also removed duplicated elements in every vector in the list (lt).

Now we can switch the original line

    gene_sets = lapply(gene_sets, function(x) intersect(x, universe))

to the optimized code:

    gene_sets = intersectToList(gene_sets, universe)

Also I use intersectToList() when intersecting genes with gene sets. The two lines that have beem optimized are marked in the following code, which I name it ora_v3().

# version 3
ora_v3 = function(genes, gene_sets, universe) {

    gs_names = names(gene_sets)

    genes = intersect(genes, universe)
    gene_sets = intersectToList(gene_sets, universe)  # this line has been improved

    n_universe = length(universe)
    n_genes = length(genes)
    x = sapply(intersectToList(gene_sets, genes), length)  # this line has been improved
    m = sapply(gene_sets, length)
    n = n_universe - m
    k = n_genes
    p = phyper(x - 1, m, n, k, lower.tail = FALSE)
    names(p) = gs_names

Since in the Cpp implementaiton, names of the gene sets are lost, in the last three lines of ora_v3(), I manually add the gene set names to the returned object p.

Now with the same genes, gs and pc_genes variables, I run ora_v3().

system.time(p3 <- ora_v3(genes, gs, pc_genes))
##    user  system elapsed 
##   0.685   0.014   0.698

It finishes in less than one second!

The three p-values vectors (p1, p2 and p3) are identical:

identical(p1, p2)
## [1] TRUE
identical(p1, p3)
## [1] TRUE

If we compare the three versions of ORA functions, ora_v3() is around 20x faster than ora_v1() and around 8x faster than ora_v2().

    "loop(v1)"       = ora_v1(genes, gs, pc_genes),
    "vectorized(v2)" = ora_v2(genes, gs, pc_genes),
    "cpp(v3)"        = ora_v3(genes, gs, pc_genes),
    times = 10
## Unit: milliseconds
##            expr        min        lq       mean     median         uq       max neval
##        loop(v1) 12357.7165 12830.180 13181.9825 13127.5096 13233.0650 15122.301    10
##  vectorized(v2)  5356.1756  5395.429  5540.5684  5500.2627  5605.6915  5900.981    10
##         cpp(v3)   637.5977   681.095   952.9262   686.9258   688.6171  3402.574    10
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## locale:
## [1] C/UTF-8/C/C/C/C
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] microbenchmark_1.4.9 Rcpp_1.0.10          GO.db_3.16.0         org.Hs.eg.db_3.16.0 
##  [5] AnnotationDbi_1.60.0 IRanges_2.32.0       S4Vectors_0.36.1     Biobase_2.58.0      
##  [9] BiocGenerics_0.44.0  knitr_1.42          
## loaded via a namespace (and not attached):
##  [1] GenomeInfoDb_1.34.9    bslib_0.4.2            compiler_4.2.0         jquerylib_0.1.4       
##  [5] XVector_0.38.0         bitops_1.0-7           tools_4.2.0            zlibbioc_1.44.0       
##  [9] digest_0.6.31          bit_4.0.5              jsonlite_1.8.4         RSQLite_2.2.20        
## [13] evaluate_0.20          memoise_2.0.1          pkgconfig_2.0.3        png_0.1-8             
## [17] rlang_1.0.6            DBI_1.1.3              cli_3.6.0              yaml_2.3.7            
## [21] blogdown_1.16          xfun_0.37              fastmap_1.1.0          GenomeInfoDbData_1.2.9
## [25] httr_1.4.4             Biostrings_2.66.0      sass_0.4.5             vctrs_0.5.2           
## [29] bit64_4.0.5            R6_2.5.1               rmarkdown_2.20         bookdown_0.32         
## [33] blob_1.2.3             htmltools_0.5.4        KEGGREST_1.38.0        RCurl_1.98-1.10       
## [37] cachem_1.0.6           crayon_1.5.2