vignettes/topic2_01_ora_single_gene_set.Rmd
topic2_01_ora_single_gene_set.Rmd
lt = readRDS(system.file("extdata", "ora.rds", package = "GSEAtraining"))
bg_gene = lt$bg_gene
diff_gene = lt$diff_gene
gene_set = lt$gene_set
The lengths of the gene lists:
length(bg_gene) # background genes
## [1] 38592
length(diff_gene)
## [1] 968
length(gene_set)
## [1] 200
## [1] 14
A safer way is to make sure diff_gene
and gene_set
are subset of bg_genes
We fill into the 2x2 contigency table:
in the set | not in the set | total | |
---|---|---|---|
DE | \(n_{11}\) | \(n_{12}\) | \(n_{1+}\) |
not DE | \(n_{21}\) | \(n_{22}\) | \(n_{2+}\) |
total | \(n_{+1}\) | \(n_{+2}\) | \(n\) |
Then we fill all values into the 2x2 table:
in the set | not in the set | total | |
---|---|---|---|
DE | 14 | 954 | 968 |
not DE | 186 | 37438 | 37624 |
total | 200 | 38392 | 38592 |
All genes are separated into two distinct sets: in the gene set and not in the gene set:
1 - phyper(14-1, 200, 38392, 968)
## [1] 0.0005686084
# or
phyper(14-1, 200, 38392, 968, lower.tail = FALSE)
## [1] 0.0005686084
The separation can be performed in the other dimension: diff genes and non-diff genes:
1 - phyper(14-1, 968, 37624, 200)
## [1] 0.0005686084
fisher.test(matrix(c(14, 186, 954, 37438), nrow = 2))
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(14, 186, 954, 37438), nrow = 2)
## p-value = 0.0005686
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.577790 5.106564
## sample estimates:
## odds ratio
## 2.953612
p = probability of a gene being diff:
1 - pbinom(14-1, 200, 968/38592)
## [1] 0.0005919725
Or p = probability of a gene being in the gene set:
1 - pbinom(14-1, 968, 200/38592)
## [1] 0.0006924557
chisq.test(matrix(c(14, 186, 954, 37438), nrow = 2), correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: matrix(c(14, 186, 954, 37438), nrow = 2)
## X-squared = 16.587, df = 1, p-value = 4.647e-05
p1 = 14/200
p2 = 954/38392
p = 968/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/200 + 1/38392)
2*pnorm(z, lower.tail = FALSE)
## [1] 4.647219e-05
z^2
## [1] 16.58685
We can also do in the other dimension, the tests are identical:
p1 = 14/968
p2 = 186/37624
p = 200/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/968 + 1/37624)
2*pnorm(z, lower.tail = FALSE)
## [1] 4.647219e-05
z^2
## [1] 16.58685
library(microbenchmark)
microbenchmark(
hyper = 1 - phyper(13, 200, 38392, 968),
fisher = fisher.test(matrix(c(14, 186, 954, 37438), nrow = 2)),
binom = 1 - pbinom(13, 968, 200/38592),
chisq = chisq.test(matrix(c(14, 186, 954, 37438), nrow = 2), correct = FALSE),
ztest = {
p1 = 14/200
p2 = 954/38392
p = 968/38592
z = abs(p1 - p2)/sqrt(p*(1-p))/sqrt(1/200 + 1/38392)
2*pnorm(z, lower.tail = FALSE)
},
times = 1000
)
## Unit: microseconds
## expr min lq mean median uq max neval
## hyper 1.152 1.5795 2.425242 2.4425 2.9880 12.869 1000
## fisher 625.561 677.7635 797.924599 719.4990 789.4285 10021.422 1000
## binom 1.026 1.4420 2.611220 2.4550 3.2280 30.534 1000
## chisq 70.502 83.2065 102.446801 93.2885 103.2565 6639.108 1000
## ztest 2.341 3.3035 5.289895 5.0715 6.2160 49.912 1000
1 - phyper(13, 200, 38392, 968)
## [1] 0.0005686084
1 - phyper(13, 200, 38392 - 10000, 968)
## [1] 0.008444096
1 - phyper(13, 200, 38392 - 20000, 968)
## [1] 0.1606062
1 - phyper(13, 200, 38392 + 10000, 968)
## [1] 5.473547e-05
1 - phyper(13, 200, 38392 + 20000, 968)
## [1] 7.140953e-06