Topic 2-01: ORA on a single gene set
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic2_01_ora_single_gene_set.Rmd
topic2_01_ora_single_gene_set.RmdExamples of ORA analysis
lt = readRDS(system.file("extdata", "ora.rds", package = "GSEAtopics"))
bg_gene = lt$bg_gene
diff_gene = lt$diff_gene
gene_set = lt$gene_setThe lengths of the gene lists:
length(bg_gene) # background genes## [1] 38592
length(diff_gene)## [1] 968
length(gene_set)## [1] 200
## [1] 14
A safer way is to make sure diff_gene and
gene_set are subset of bg_genes
We fill into the 2x2 contigency table:
| in the set | not in the set | total | |
|---|---|---|---|
| DE | |||
| not DE | |||
| total |
Then we fill all values into the 2x2 table:
| in the set | not in the set | total | |
|---|---|---|---|
| DE | 14 | 954 | 968 |
| not DE | 186 | 37438 | 37624 |
| total | 200 | 38392 | 38592 |
By hypergeometric distribution
All genes are separated into two distinct sets: in the gene set and not in the gene set:
1 - phyper(14-1, 200, 38392, 968)## [1] 0.0005686084
# or
phyper(14-1, 200, 38392, 968, lower.tail = FALSE)## [1] 0.0005686084
The separation can be performed in the other dimension: diff genes and non-diff genes:
1 - phyper(14-1, 968, 37624, 200)## [1] 0.0005686084
By Fisher’s Exact test
fisher.test(matrix(c(14, 186, 954, 37438), nrow = 2))##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(14, 186, 954, 37438), nrow = 2)
## p-value = 0.0005686
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.577790 5.106564
## sample estimates:
## odds ratio
## 2.953612
By Binomial distribution
p = probability of a gene being diff:
1 - pbinom(14-1, 200, 968/38592)## [1] 0.0005919725
Or p = probability of a gene being in the gene set:
1 - pbinom(14-1, 968, 200/38592)## [1] 0.0006924557
By Chi-square test
chisq.test(matrix(c(14, 186, 954, 37438), nrow = 2), correct = FALSE)##
## Pearson's Chi-squared test
##
## data: matrix(c(14, 186, 954, 37438), nrow = 2)
## X-squared = 16.587, df = 1, p-value = 4.647e-05
By two-sample z-test
p1 = 14/200
p2 = 954/38392
p = 968/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/200 + 1/38392)
2*pnorm(z, lower.tail = FALSE)## [1] 4.647219e-05
z^2## [1] 16.58685
We can also do in the other dimension, the tests are identical:
p1 = 14/968
p2 = 186/37624
p = 200/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/968 + 1/37624)
2*pnorm(z, lower.tail = FALSE)## [1] 4.647219e-05
z^2## [1] 16.58685
Test the speed of various tests
library(microbenchmark)
microbenchmark(
hyper = 1 - phyper(13, 200, 38392, 968),
fisher = fisher.test(matrix(c(14, 186, 954, 37438), nrow = 2)),
binom = 1 - pbinom(13, 968, 200/38592),
chisq = chisq.test(matrix(c(14, 186, 954, 37438), nrow = 2), correct = FALSE),
ztest = {
p1 = 14/200
p2 = 954/38392
p = 968/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/200 + 1/38392)
2*pnorm(z, lower.tail = FALSE)
},
times = 1000
)## Unit: nanoseconds
## expr min lq mean median uq max neval
## hyper 410 574.0 628.120 615 697 3362 1000
## fisher 195160 213302.5 240874.221 221113 230543 3158804 1000
## binom 369 492.0 593.229 615 697 2091 1000
## chisq 20910 22714.0 24394.959 24190 25584 45264 1000
## ztest 779 1025.0 1316.879 1394 1517 4551 1000
The effect of background size
1 - phyper(13, 200, 38392, 968)## [1] 0.0005686084
1 - phyper(13, 200, 38392 - 10000, 968)## [1] 0.008444096
1 - phyper(13, 200, 38392 - 20000, 968)## [1] 0.1606062
1 - phyper(13, 200, 38392 + 10000, 968)## [1] 5.473547e-05
1 - phyper(13, 200, 38392 + 20000, 968)## [1] 7.140953e-06