In this document, we will discuss the use of background regions. We first demonstrate it with a ChIP-seq TFBS dataset from UCSC table browser. Parameters are:
In the “Select dataset” section:
clade = Mammal
genome = Human
assembly = GRCh37/hg19
group = Regulation
track = ENCODE 3 TFBS
table: GM12878 MYB
And in the “Retrieve and display data” section:
output format = BED - browser extensible data
Then click the button “get output”.
We first read it as a GRanges
object.
library(rGREAT)
df = read.table("data/tb_encTfChipPkENCFF215YWS_GM12878_MYB_hg19.bed")
df = df[df[, 1] %in% paste0("chr", c(1:22, "X", "Y")), ]
gr = GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2] + 1, df[, 3]))
The next two GREAT analysis uses the whole genome as background and excludes gap regions.
res1 = great(gr, "GO:BP", "hg19", exclude = NULL)
res2 = great(gr, "GO:BP", "hg19", exclude = "gap")
And we compare the significant GO terms:
tb1 = getEnrichmentTable(res1)
tb2 = getEnrichmentTable(res2)
library(eulerr)
lt = list(
genome = tb1$id[tb1$p_adjust < 0.001],
exclude_gap = tb2$id[tb2$p_adjust < 0.001]
)
plot(euler(lt), quantities = TRUE)
We can see, when excluding gap regions, there are fewer significant terms left. In GREAT analysis where Binomial test is applied, denote following variables: \(N\) as total number of input regions, \(n\) as number of input regions that fall into the extended TSS regions (denote the corresponding random variable as \(X\)), and \(p\) as the fraction of extended TSS regions in the genome, then \(X \sim B(p, N)\).
When gap regions are removed from the analysis, \(N\) and \(n\) are most likely unchanged because the input regions will not overlap to the gap regions, but \(p\) gets higher because the coverage of extended TSS regions is most likely unchanged, but the background size (genome subtracting gap regions) becomes smaller. When \(p\) gets higher while \(N\) and \(n\) are unchanged, the p-value from Binomial test also gets higher (when \(p\) gets higher, the probability of obtaining \(n\) regions from \(N\) with a success rate of \(p\) increases), which decreases the number of significant terms.
The null assumption of the Binomial test is that input regions are uniformaly distributed in the genome. Since gap regions are not sequenced and input regions will never overlap to them, using whole genome as background will not be proper and it under-estimates the fraction of a function associated regions in the genome, i.e. the value of \(p\), which results more smaller p-values and possible more false positives.
This might happen for other scenations, e.g. when dealing with methylation-related regions (e.g. differentially methylated regions, DMRs), choosing a background showing similar CpG density might be more proper, which can decrease the false positives caused by the regions that will never be called as DMRs (e.g. CpG poor regions).
This scenario might be less used, but it is a good example to show the idea. In great()
, arguments background
and exclude
can be set to a vector of chromosomes to keep or to remove certain chromosomes, then all analysis is restricted in the selected chromosomes. In the following example, we apply GREAT analysis only on one single chromosome at a time.
res_list = list()
for(chr in paste0("chr", 1:22)) {
res_list[[chr]] = great(gr, "GO:BP", "hg19", background = chr)
}
Next we compare the signigicant GO terms. In the followig plot, each row is a GO term, the first heatmap shows whether the GO term is significant (red) in the corresponding chromosome and the right square heatmap shows the similarities of GO terms. GO terms are clustered into several groups by clustering the similarity matrix. Summaries of GO terms in clusters are represented as word clouds and they are attached to the heatmap.
sig_list = lapply(res_list, function(x) {
tb = getEnrichmentTable(x)
tb$id[tb$p_adjust < 0.01]
})
library(simplifyEnrichment)
simplifyGOFromMultipleLists(sig_list)
From the plot we can see there are some degrees of specificities of the enrichment on different chromosomes, which are mainly caused by the unequal distribution of genes on chromosomes in gene sets.
It is more common to set background by a set of pre-defined genomic regions. In the following example, we use the chromatin states dataset from the same cell line as in the first example. Parameters for retrieving the data from UCSC table browser are:
clade = Mammal
genome = Human
assembly = GRCh37/hg19
group = Regulation
track = Broad ChromHMM
table: GM12878 ChromHMM
We read the chromatin states data and format it as a list where each element in the list corresponds to regions in one single chromatin state.
df = read.table("data/GM12878_chromHMM.bed")
df = df[df[, 1] %in% paste0("chr", c(1:22, "X", "Y")), ]
all_states = GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2] + 1, df[, 3]),
state = gsub("^\\d+_", "", df[, 4]))
all_states = split(all_states, all_states$state)
names(all_states)
## [1] "Active_Promoter" "Heterochrom/lo" "Insulator" "Poised_Promoter"
## [5] "Repetitive/CNV" "Repressed" "Strong_Enhancer" "Txn_Elongation"
## [9] "Txn_Transition" "Weak_Enhancer" "Weak_Promoter" "Weak_Txn"
Each time, we set background to only one chromatin state:
res_list = lapply(all_states, function(bg) {
great(gr, "GO:BP", "hg19", background = bg)
})
We compare the numbers of significant GO terms:
sig_list = lapply(res_list, function(x) {
tb = getEnrichmentTable(x)
tb$id[tb$p_adjust < 0.05]
})
sapply(sig_list, length)
## Active_Promoter Heterochrom/lo Insulator Poised_Promoter Repetitive/CNV
## 41 332 1 0 0
## Repressed Strong_Enhancer Txn_Elongation Txn_Transition Weak_Enhancer
## 0 217 0 0 30
## Weak_Promoter Weak_Txn
## 20 9
It is interesting to see there are a lot of GO terms enriched when using heterchromatin as background. In the following parts of this document, we only look at three chromatin states: active promoter, strong enhancer and heterochromatin:
res_list = res_list[c("Active_Promoter", "Strong_Enhancer", "Heterochrom/lo")]
Next two plots demonstrate total widths of background regions in different states, and numbers of TFBS peaks falling in different background.
par(mar = c(4, 8, 4, 1), mfrow = c(1, 2), xpd = NA)
w = sapply(res_list, function(x) {
sum(width(x@background))
})
barplot(w, horiz = TRUE, las = 1, xlab = "Base pairs", xlim = c(1, 2e9),
main = "Sum of widths / background width")
n_gr = sapply(res_list, function(x) {
x@n_total
})
barplot(n_gr, horiz = TRUE, las = 1, xlab = "Number",
main = "Number of gr in background")
text(n_gr, 1:3, n_gr)
It is quite astonishing that there are only 62 TFBS peaks in heterochromatin regions, but the heterochromatin regions cover more than 90% of the genome. We check the top most significant GO terms:
tb = getEnrichmentTable(res_list[["Heterochrom/lo"]])
head(tb)
## id description genome_fraction
## 1 GO:0032940 secretion by cell 0.08394554
## 2 GO:1903530 regulation of secretion by cell 0.06109787
## 3 GO:0140352 export from cell 0.08904024
## 4 GO:0051046 regulation of secretion 0.06647850
## 5 GO:0046903 secretion 0.09617703
## 6 GO:1903532 positive regulation of secretion by cell 0.02817129
## observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist
## 1 19 3.650600 4.403571e-07 0.0002212483 295062
## 2 16 4.223789 6.860412e-07 0.0002212483 290352
## 3 19 3.441721 1.078219e-06 0.0002318172 295062
## 4 16 3.881925 2.072632e-06 0.0003342118 290352
## 5 19 3.186328 3.403655e-06 0.0004390714 295062
## 6 10 5.725345 8.854797e-06 0.0009518906 346161
## observed_gene_hits gene_set_size fold_enrichment_hyper p_value_hyper
## 1 16 689 4.062599 1.617190e-06
## 2 13 481 4.728261 3.346546e-06
## 3 16 743 3.767336 4.276796e-06
## 4 13 535 4.251016 1.056093e-05
## 5 16 815 3.434516 1.368991e-05
## 6 7 246 4.978128 5.198301e-04
## p_adjust_hyper
## 1 0.0003476959
## 2 0.0005396305
## 3 0.0005517067
## 4 0.0007568663
## 5 0.0008829991
## 6 0.0052389126
Take the first term “GO:0032940” for example, we can actually see, although there are only 62 TFBS peaks, 20 of them are associated with “GO:0032940” and it is more than five times higher than the expected number: 62*0.0844 = 5.2.
However, the mean distance of “GO:0032940” associated TFBS peaks to TSS in the heterochormatin background is 290kb, and mean distance to TSS of all peaks in heterochromatin is around 350kb, which is very far from TSS. With larger distance to TSS, the more we need to be careful with the relaiblity of the associations.
par(mar = c(4, 8, 4, 1), xpd = NA)
dist_list = sapply(res_list, function(x) {
gra = getRegionGeneAssociations(x)
mean(abs(unlist(gra$dist_to_TSS)))
})
barplot(dist_list, horiz = TRUE, las = 1, xlab = "mean distance to TSS",
main = "mean distance to TSS")
Therefore, we will not consider the category of heterochromatin in the analysis because although it generates many significant terms statistically, whether they make biological sense needs to be questioned.
In the end, we only compare the significant GO terms from TFBS peaks by taking promoters and enhancers as backgrounds. We use a loose cutoff for adjusted p-values (0.1) because more GO terms will give a better clustering.
sig_lt2 = lapply(res_list[1:2], function(x) {
tb = getEnrichmentTable(x)
tb$id[tb$p_adjust < 0.1]
})
simplifyGOFromMultipleLists(sig_lt2)
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] grid stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] simplifyEnrichment_1.7.2 eulerr_6.1.1 rGREAT_1.99.7
## [4] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2 IRanges_2.30.0
## [7] S4Vectors_0.34.0 BiocGenerics_0.42.0 knitr_1.39
## [10] colorout_1.2-2
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-3
## [2] rjson_0.2.21
## [3] ellipsis_0.3.2
## [4] circlize_0.4.16
## [5] XVector_0.36.0
## [6] GlobalOptions_0.1.2
## [7] clue_0.3-61
## [8] DT_0.23
## [9] bit64_4.0.5
## [10] AnnotationDbi_1.58.0
## [11] fansi_1.0.3
## [12] xml2_1.3.3
## [13] codetools_0.2-18
## [14] doParallel_1.0.17
## [15] cachem_1.0.6
## [16] GOSemSim_2.22.0
## [17] polyclip_1.10-0
## [18] jsonlite_1.8.0
## [19] Cairo_1.6-0
## [20] Rsamtools_2.12.0
## [21] cluster_2.1.3
## [22] GO.db_3.15.0
## [23] dbplyr_2.2.1
## [24] png_0.1-7
## [25] shiny_1.7.2
## [26] compiler_4.2.0
## [27] httr_1.4.3
## [28] assertthat_0.2.1
## [29] Matrix_1.4-1
## [30] fastmap_1.1.0
## [31] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [32] cli_3.3.0
## [33] later_1.3.0
## [34] htmltools_0.5.3
## [35] prettyunits_1.1.1
## [36] tools_4.2.0
## [37] NLP_0.2-1
## [38] glue_1.6.2
## [39] GenomeInfoDbData_1.2.8
## [40] dplyr_1.0.9
## [41] rappdirs_0.3.3
## [42] Rcpp_1.0.9
## [43] slam_0.1-50
## [44] TxDb.Hsapiens.UCSC.hg38.knownGene_3.15.0
## [45] Biobase_2.56.0
## [46] jquerylib_0.1.4
## [47] vctrs_0.4.1
## [48] Biostrings_2.64.0
## [49] rtracklayer_1.56.1
## [50] iterators_1.0.14
## [51] xfun_0.31
## [52] polylabelr_0.2.0
## [53] stringr_1.4.0
## [54] mime_0.12
## [55] lifecycle_1.0.1
## [56] restfulr_0.0.15
## [57] XML_3.99-0.10
## [58] org.Hs.eg.db_3.15.0
## [59] zlibbioc_1.42.0
## [60] hms_1.1.1
## [61] promises_1.2.0.1
## [62] MatrixGenerics_1.8.1
## [63] parallel_4.2.0
## [64] SummarizedExperiment_1.26.1
## [65] RColorBrewer_1.1-3
## [66] ComplexHeatmap_2.13.2
## [67] yaml_2.3.5
## [68] curl_4.3.2
## [69] memoise_2.0.1
## [70] sass_0.4.2
## [71] biomaRt_2.52.0
## [72] stringi_1.7.8
## [73] RSQLite_2.2.15
## [74] highr_0.9
## [75] BiocIO_1.6.0
## [76] foreach_1.5.2
## [77] GenomicFeatures_1.48.3
## [78] filelock_1.0.2
## [79] BiocParallel_1.30.3
## [80] shape_1.4.6
## [81] rlang_1.0.4
## [82] pkgconfig_2.0.3
## [83] matrixStats_0.62.0
## [84] bitops_1.0-7
## [85] evaluate_0.15
## [86] lattice_0.20-45
## [87] purrr_0.3.4
## [88] GenomicAlignments_1.32.1
## [89] htmlwidgets_1.5.4
## [90] bit_4.0.4
## [91] tidyselect_1.1.2
## [92] magrittr_2.0.3
## [93] R6_2.5.1
## [94] magick_2.7.3
## [95] generics_0.1.3
## [96] DelayedArray_0.22.0
## [97] DBI_1.1.3
## [98] pillar_1.8.0
## [99] proxyC_0.2.4
## [100] KEGGREST_1.36.3
## [101] RCurl_1.98-1.8
## [102] tibble_3.1.8
## [103] crayon_1.5.1
## [104] utf8_1.2.2
## [105] BiocFileCache_2.4.0
## [106] rmarkdown_2.14
## [107] GetoptLong_1.1.0
## [108] progress_1.2.2
## [109] blob_1.2.3
## [110] digest_0.6.29
## [111] tm_0.7-8
## [112] xtable_1.8-4
## [113] httpuv_1.6.5
## [114] RcppParallel_5.1.5
## [115] bslib_0.4.0