Skip to contents

ORA extension

We start with a differential gene list and an universe gene list. Let’s first convert them to EntreZ IDs:

library(CePa)
data(gene.list)
genes = gene.list$dif
universe = gene.list$bk
library(GSEAtopics)
library(org.Hs.eg.db)
genes = convert_to_entrez(genes)
##   gene id might be SYMBOL (p =  0.860 )
universe = convert_to_entrez(universe)
##   gene id might be SYMBOL (p =  0.890 )

The most important part in topology-based pathway enrichment analysis is that we need pathways in the network representations. The KEGGgraph can read network data of KEGG pathways and convert it to proper format.

library(KEGGgraph)
df = parseKGML2DataFrame(system.file("extdata/hsa04010.xml", package = "KEGGgraph"))
head(df)
##       from        to  type    subtype
## 1 hsa:5923 hsa:22800 PPrel activation
## 2 hsa:5923 hsa:22808 PPrel activation
## 3 hsa:5923  hsa:3265 PPrel activation
## 4 hsa:5923  hsa:3845 PPrel activation
## 5 hsa:5923  hsa:4893 PPrel activation
## 6 hsa:5923  hsa:6237 PPrel activation
df[, 1] = gsub("^hsa:", "", df[, 1])
df[, 2] = gsub("^hsa:", "", df[, 2])

We only need the first two columns (from and to) to construct the network.

## IGRAPH 6873575 DN-- 262 1029 -- 
## + attr: name (v/c)
## + edges from 6873575 (vertex names):
##  [1] 5923 ->22800 5923 ->22808 5923 ->3265  5923 ->3845  5923 ->4893 
##  [6] 5923 ->6237  5924 ->22800 5924 ->22808 5924 ->3265  5924 ->3845 
## [11] 5924 ->4893  5924 ->6237  11072->5594  11072->5594  11072->5595 
## [16] 11072->5595  11072->5599  11072->5599  11072->5601  11072->5601 
## [21] 11072->5602  11072->5602  11072->1432  11072->1432  11072->5600 
## [26] 11072->5600  11072->5603  11072->5603  11072->6300  11072->6300 
## [31] 11221->5594  11221->5594  11221->5595  11221->5595  11221->5599 
## [36] 11221->5599  11221->5601  11221->5601  11221->5602  11221->5602 
## + ... omitted several edges

The igraph package provides many different centrality methods. Here we take degree() as an example:

d = degree(g)

Make sure all genes in d are within universe.

d = d[intersect(names(d), universe)]

The statistic for the gene set is inI(Gi)di\sum_i^n I(G_i) \cdot d_i

cn = intersect(genes, names(d))
s = sum(d[cn])

or

s = sum( (names(d) %in% genes)*d )

In the random permutation, we randomly pick length(genes) as differential genes.

s_random = numeric(1000)
for(i in 1:1000) {
    genes_random = sample(universe, length(genes))
    cn = intersect(genes_random, names(d))
    s_random[i] = sum(d[cn])
}

P-value is calculated in the normal way.

hist(s_random)
abline(v = s, col = 2)

sum(s_random > s)/length(s_random)
## [1] 0.023

GSEA-extension

We only consider the gene permutation process. We generate a vector of gene scores.

score = structure(rnorm(length(universe)), names = universe)

As d has already been restricted within universe, the set-level statistic is calculated as 1nin|si|di\frac{1}{n}\sum_i^n |s_i|\cdot d_i

s = mean(abs(score[names(d)])*d)

In the gene permutation, the differential scores are randomly assigned to genes.

s_random = numeric(1000)
for(i in 1:1000) {
    score_random = sample(score, length(d))
    s_random[i] = mean(abs(score_random)*d)
}
hist(s_random)
abline(v = s, col = 2)

sum(s_random > s)/length(s_random)
## [1] 0.504

Question: can you do a two-sided test?

Helper functions

Two helper funtions are provided:

ora_kegg_topology(genes, universe, centrality = igraph::degree, verbose = FALSE) |> head()
##    pathway  k    s n_gs    log2fe  z_score p_value   p_adjust
## 1 hsa03460 16   82   31 2.4471071 6.523777   0.000 0.00000000
## 2 hsa04010 51 1085  285 1.0874550 4.851782   0.000 0.00000000
## 3 hsa04114 23  490  108 1.5614595 5.167157   0.000 0.00000000
## 4 hsa04218 28  251  139 0.9693699 3.345670   0.000 0.00000000
## 5 hsa00071 13  122   40 1.6274315 4.223759   0.001 0.02833333
## 6 hsa04014 35  812  214 1.1750747 4.572617   0.001 0.02833333
##                                     description
## 1 Fanconi anemia pathway - Homo sapiens (human)
## 2 MAPK signaling pathway - Homo sapiens (human)
## 3         Oocyte meiosis - Homo sapiens (human)
## 4    Cellular senescence - Homo sapiens (human)
## 5 Fatty acid degradation - Homo sapiens (human)
## 6  Ras signaling pathway - Homo sapiens (human)
gsea_kegg_topology(score, centrality = igraph::degree, verbose = FALSE) |> head()
##    pathway        s n_gs    log2fe  z_score p_value p_adjust
## 1 hsa04260 3.036433   21 0.6631036 2.537344   0.011   0.7512
## 2 hsa00531 3.506256   19 0.5383367 2.189644   0.014   0.7512
## 3 hsa04918 5.301637   42 0.4324909 2.146144   0.021   0.7512
## 4 hsa04742 4.479651   19 0.5837811 2.250487   0.022   0.7512
## 5 hsa05332 7.069314   20 0.4858661 2.084989   0.027   0.7512
## 6 hsa00010 7.650973   63 0.2563581 1.971894   0.028   0.7512
##                                            description
## 1    Cardiac muscle contraction - Homo sapiens (human)
## 2 Glycosaminoglycan degradation - Homo sapiens (human)
## 3     Thyroid hormone synthesis - Homo sapiens (human)
## 4            Taste transduction - Homo sapiens (human)
## 5     Graft-versus-host disease - Homo sapiens (human)
## 6  Glycolysis / Gluconeogenesis - Homo sapiens (human)

CePa

The CePa package has not been updated for ~10 years.

It will try six different centralities:

  • equal.weight
  • in.degree
  • out.degree
  • node betweenness
  • in.reach
  • out.reach

dif and bk are two gene vectors, pc is in a special format.

# around 5min
res = cepa.ora.all(dif = gene.list$dif, bk = gene.list$bk, pc = PID.db$NCI)

Heatmap of adjusted p-values:

plot(res, adj.method = "BH", only.sig = TRUE)

Advantages:

  1. use multiple centrality metrics,
  2. node-level instead of gene-level
  3. also include non-gene nodes.