library(rGREAT)
Repo: https://github.com/jokergoo/jokergoo.github.io/tree/master/talks
We first read the regions (peaks of the transcription factor JUN) as a data frame.
= read.table("Encode3_JUN_hg19.bed") bed
We use standard Bioconductor data structure in local GREAT analysis, so we change bed
to a GRanges
object.
= GRanges(seqnames = bed[, 1], ranges = IRanges(bed[, 2], bed[, 3]))
gr gr
## GRanges object with 1726 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 936221-936347 *
## [2] chr1 1280252-1280492 *
## [3] chr1 2143913-2144153 *
## [4] chr1 4052186-4052426 *
## [5] chr1 4440698-4440938 *
## ... ... ... ...
## [1722] chr22 39096963-39097203 *
## [1723] chr22 39704855-39705095 *
## [1724] chr22 40894508-40894748 *
## [1725] chr22 43011250-43011490 *
## [1726] chr22 50364455-50364695 *
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
The function great()
is the core function to perform local GREAT analysis. You can either use built-in annotations or use self-provided annotations.
Using integrated gene sets
The GO gene sets
great()
supports GO gene sets for many organisms. The usage is:
= great(gr, "GO:BP", "hg19")
res_bp res_bp
## 1726 regions are associated to 2268 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: GO:BP
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
The GO gene sets are from the organism org.*.db packages. TSS definitions are mainly from TxDb packages. All supported organisms are:
:::BIOC_ANNO_PKGS rGREAT
## species_name species_latin taxon_id
## 1 human Homo sapiens 9606
## 2 human Homo sapiens 9606
## 3 human Homo sapiens 9606
## 4 human Homo sapiens 9606
## 5 mouse Mus musculus 10090
## 6 mouse Mus musculus 10090
## 7 mouse Mus musculus 10090
## 8 mouse Mus musculus 10090
## 9 rat Rattus norvegicus 10116
## 10 rat Rattus norvegicus 10116
## 11 rat Rattus norvegicus 10116
## 12 rat Rattus norvegicus 10116
## 13 chicken Gallus gallus 9031
## 14 chicken Gallus gallus 9031
## 15 chicken Gallus gallus 9031
## 16 rhesus Macaca mulatta 9544
## 17 rhesus Macaca mulatta 9544
## 18 rhesus Macaca mulatta 9544
## 19 worm Caenorhabditis elegans 6239
## 20 worm Caenorhabditis elegans 6239
## 21 dog Canis familiari 9615
## 22 dog Canis familiari 9615
## 23 dog Canis familiari 9615
## 24 pig Sus scrofa 9823
## 25 pig Sus scrofa 9823
## 26 yeast Saccharomyces cerevisiae 4932
## 27 yeast Saccharomyces cerevisiae 4932
## 28 chimp Pan troglodytes 9598
## 29 chimp Pan troglodytes 9598
## 30 chimp Pan troglodytes 9598
## 31 fruitfly Drosophila melanogaster 7227
## 32 fruitfly Drosophila melanogaster 7227
## 33 zebrafish Danio rerio 7955
## 34 zebrafish Danio rerio 7955
## 35 cattle Bos taurus 9913
## 36 cattle Bos taurus 9913
## 37 thale cress Arabidopsis thaliana 3702
## 38 thale cress Arabidopsis thaliana 3702
## 39 thale cress Arabidopsis thaliana 3702
## 40 thale cress Arabidopsis thaliana 3702
## txdb gene_id_in_txdb
## 1 TxDb.Hsapiens.UCSC.hg18.knownGene Entrez Gene ID
## 2 TxDb.Hsapiens.UCSC.hg19.knownGene Entrez Gene ID
## 3 TxDb.Hsapiens.UCSC.hg38.knownGene Entrez Gene ID
## 4 TxDb.Hsapiens.UCSC.hg38.refGene Entrez Gene ID
## 5 TxDb.Mmusculus.UCSC.mm10.knownGene Entrez Gene ID
## 6 TxDb.Mmusculus.UCSC.mm10.ensGene Ensembl gene ID
## 7 TxDb.Mmusculus.UCSC.mm39.refGene Entrez Gene ID
## 8 TxDb.Mmusculus.UCSC.mm9.knownGene Entrez Gene ID
## 9 TxDb.Rnorvegicus.UCSC.rn4.ensGene Ensembl gene ID
## 10 TxDb.Rnorvegicus.UCSC.rn5.refGene Entrez Gene ID
## 11 TxDb.Rnorvegicus.UCSC.rn6.refGene Entrez Gene ID
## 12 TxDb.Rnorvegicus.UCSC.rn7.refGene Entrez Gene ID
## 13 TxDb.Ggallus.UCSC.galGal4.refGene Entrez Gene ID
## 14 TxDb.Ggallus.UCSC.galGal5.refGene Entrez Gene ID
## 15 TxDb.Ggallus.UCSC.galGal6.refGene Entrez Gene ID
## 16 TxDb.Mmulatta.UCSC.rheMac10.refGene Entrez Gene ID
## 17 TxDb.Mmulatta.UCSC.rheMac3.refGene Entrez Gene ID
## 18 TxDb.Mmulatta.UCSC.rheMac8.refGene Entrez Gene ID
## 19 TxDb.Celegans.UCSC.ce11.refGene Entrez Gene ID
## 20 TxDb.Celegans.UCSC.ce11.ensGene Ensembl gene ID
## 21 TxDb.Cfamiliaris.UCSC.canFam3.refGene Entrez Gene ID
## 22 TxDb.Cfamiliaris.UCSC.canFam4.refGene Entrez Gene ID
## 23 TxDb.Cfamiliaris.UCSC.canFam5.refGene Entrez Gene ID
## 24 TxDb.Sscrofa.UCSC.susScr11.refGene Entrez Gene ID
## 25 TxDb.Sscrofa.UCSC.susScr3.refGene Entrez Gene ID
## 26 TxDb.Scerevisiae.UCSC.sacCer2.sgdGene SGD Gene ID
## 27 TxDb.Scerevisiae.UCSC.sacCer3.sgdGene SGD Gene ID
## 28 TxDb.Ptroglodytes.UCSC.panTro4.refGene Entrez Gene ID
## 29 TxDb.Ptroglodytes.UCSC.panTro5.refGene Entrez Gene ID
## 30 TxDb.Ptroglodytes.UCSC.panTro6.refGene Entrez Gene ID
## 31 TxDb.Dmelanogaster.UCSC.dm3.ensGene Ensembl gene ID
## 32 TxDb.Dmelanogaster.UCSC.dm6.ensGene Ensembl gene ID
## 33 TxDb.Drerio.UCSC.danRer10.refGene Entrez Gene ID
## 34 TxDb.Drerio.UCSC.danRer11.refGene Entrez Gene ID
## 35 TxDb.Btaurus.UCSC.bosTau8.refGene Entrez Gene ID
## 36 TxDb.Btaurus.UCSC.bosTau9.refGene Entrez Gene ID
## 37 TxDb.Athaliana.BioMart.plantsmart51 TAIR ID
## 38 TxDb.Athaliana.BioMart.plantsmart22 TAIR ID
## 39 TxDb.Athaliana.BioMart.plantsmart25 TAIR ID
## 40 TxDb.Athaliana.BioMart.plantsmart28 TAIR ID
## genome_version_in_txdb orgdb primary_gene_id_in_orgdb
## 1 hg18 org.Hs.eg.db Entrez Gene ID
## 2 hg19 org.Hs.eg.db Entrez Gene ID
## 3 hg38 org.Hs.eg.db Entrez Gene ID
## 4 hg38 org.Hs.eg.db Entrez Gene ID
## 5 mm10 org.Mm.eg.db Entrez Gene ID
## 6 mm10 org.Mm.eg.db Entrez Gene ID
## 7 mm39 org.Mm.eg.db Entrez Gene ID
## 8 mm9 org.Mm.eg.db Entrez Gene ID
## 9 rn4 org.Rn.eg.db Entrez Gene ID
## 10 rn5 org.Rn.eg.db Entrez Gene ID
## 11 rn6 org.Rn.eg.db Entrez Gene ID
## 12 rn7 org.Rn.eg.db Entrez Gene ID
## 13 galGal4 org.Gg.eg.db Entrez Gene ID
## 14 galGal5 org.Gg.eg.db Entrez Gene ID
## 15 galGal6 org.Gg.eg.db Entrez Gene ID
## 16 rheMac10 org.Mmu.eg.db Entrez Gene ID
## 17 rheMac3 org.Mmu.eg.db Entrez Gene ID
## 18 rheMac8 org.Mmu.eg.db Entrez Gene ID
## 19 ce11 org.Ce.eg.db Entrez Gene ID
## 20 ce11 org.Ce.eg.db Entrez Gene ID
## 21 canFam3 org.Cf.eg.db Entrez Gene ID
## 22 canFam4 org.Cf.eg.db Entrez Gene ID
## 23 canFam5 org.Cf.eg.db Entrez Gene ID
## 24 susScr11 org.Ss.eg.db Entrez Gene ID
## 25 susScr3 org.Ss.eg.db Entrez Gene ID
## 26 sacCer2 org.Sc.sgd.db SGD Gene ID
## 27 sacCer3 org.Sc.sgd.db SGD Gene ID
## 28 panTro4 org.Pt.eg.db Entrez Gene ID
## 29 panTro5 org.Pt.eg.db Entrez Gene ID
## 30 panTro6 org.Pt.eg.db Entrez Gene ID
## 31 dm3 org.Dm.eg.db Entrez Gene ID
## 32 dm6 org.Dm.eg.db Entrez Gene ID
## 33 danRer10 org.Dr.eg.db Entrez Gene ID
## 34 danRer11 org.Dr.eg.db Entrez Gene ID
## 35 bosTau8 org.Bt.eg.db Entrez Gene ID
## 36 bosTau9 org.Bt.eg.db Entrez Gene ID
## 37 araTha org.At.tair.db TAIR ID
## 38 araTha org.At.tair.db TAIR ID
## 39 araTha org.At.tair.db TAIR ID
## 40 araTha org.At.tair.db TAIR ID
## chromosome_prefix
## 1 chr
## 2 chr
## 3 chr
## 4 chr
## 5 chr
## 6 chr
## 7 chr
## 8 chr
## 9 chr
## 10 chr
## 11 chr
## 12 chr
## 13 chr
## 14 chr
## 15 chr
## 16 chr
## 17 chr
## 18 chr
## 19 chr
## 20 chr
## 21 chr
## 22 chr
## 23 chr
## 24 chr
## 25 chr
## 26 chr
## 27 chr
## 28 chr
## 29 chr
## 30 chr
## 31 chr
## 32 chr
## 33 chr
## 34 chr
## 35 chr
## 36 chr
## 37
## 38
## 39
## 40
MSigDB gene sets
The second category of gene sets are from MSigDB. Note this is only for human.
= great(gr, "msigdb:h", "hg19")
res_h res_h
## 1726 regions are associated to 2268 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: msigdb:h
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
All supported MSigDB gene sets are:
"msigdb:H"
Hallmark gene sets."msigdb:C1"
Positional gene sets."msigdb:C2"
Curated gene sets."msigdb:C2:CGP"
C2 subcategory: chemical and genetic perturbations gene sets."msigdb:C2:CP"
C2 subcategory: canonical pathways gene sets."msigdb:C2:CP:KEGG"
C2 subcategory: KEGG subset of CP."msigdb:C2:CP:PID"
C2 subcategory: PID subset of CP."msigdb:C2:CP:REACTOME"
C2 subcategory: REACTOME subset of CP."msigdb:C2:CP:WIKIPATHWAYS"
C2 subcategory: WIKIPATHWAYS subset of CP."msigdb:C3"
Regulatory target gene sets."msigdb:C3:MIR:MIRDB"
miRDB of microRNA targets gene sets."msigdb:C3:MIR:MIR_LEGACY"
MIR_Legacy of MIRDB."msigdb:C3:TFT:GTRD"
GTRD transcription factor targets gene sets."msigdb:C3:TFT:TFT_LEGACY"
TFT_Legacy."msigdb:C4"
Computational gene sets."msigdb:C4:CGN"
C4 subcategory: cancer gene neighborhoods gene sets."msigdb:C4:CM"
C4 subcategory: cancer modules gene sets."msigdb:C5"
Ontology gene sets."msigdb:C5:GO:BP"
C5 subcategory: BP subset."msigdb:C5:GO:CC"
C5 subcategory: CC subset."msigdb:C5:GO:MF"
C5 subcategory: MF subset."msigdb:C5:HPO"
C5 subcategory: human phenotype ontology gene sets."msigdb:C6"
Oncogenic signature gene sets."msigdb:C7"
Immunologic signature gene sets."msigdb:C7:IMMUNESIGDB"
ImmuneSigDB subset of C7."msigdb:C7:VAX"
C7 subcategory: vaccine response gene sets."msigdb:C8"
Cell type signature gene sets.
The prefix msigdb:
can be omitted when specified in great()
and the name of a MSigDB can be used as case insensitive.
great(gr, "h", "hg19")
great(gr, "c2:cp", "hg19")
Manually set gene sets
More gene sets can be set as a list of gene vectors. Note genes in the gene sets must be mostly in Entrez ID type. In the following example, we use a gene set collection from DSigDB. This collection (FDA approved kinase inhibitors) only contains 28 gene sets.
Here read_gmt()
is a simple function that reads a gmt file as a list of vectors, also performs gene ID conversion.
= read_gmt(url("http://dsigdb.tanlab.org/Downloads/D2_LINCS.gmt"),
gs from = "SYMBOL", to = "ENTREZ", orgdb = "org.Hs.eg.db")
1:2] gs[
## $GSK429286A
## [1] "5566" "9475" "81788" "6093" "5562" "26524" "5592" "5979" "640"
## [10] "5567" "3717" "5613" "23012" "695" "3718"
##
## $`BS-181`
## [1] "122011" "1452" "1022" "7272" "1454" "1453" "65975" "147746"
## [9] "1859" "1195" "1196" "9149" "57396" "5261"
great(gr, gs, "hg19")
## 1726 regions are associated to 2268 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Or using KEGG gene sets:
= KEGGREST::keggLink("hsa", "pathway")
kegg = split(kegg, names(kegg))
kegg_gs = lapply(kegg_gs, function(x) unname(gsub("hsa:", "", x)))
kegg_gs names(kegg_gs) = gsub("path:", "", names(kegg_gs))
1:2] kegg_gs[
## $hsa00010
## [1] "10327" "124" "125" "126" "127" "128" "130" "130589"
## [9] "131" "160287" "1737" "1738" "2023" "2026" "2027" "217"
## [17] "218" "219" "2203" "221" "222" "223" "224" "226"
## [25] "229" "230" "2538" "2597" "26330" "2645" "2821" "3098"
## [33] "3099" "3101" "387712" "3939" "3945" "3948" "441531" "501"
## [41] "5105" "5106" "5160" "5161" "5162" "5211" "5213" "5214"
## [49] "5223" "5224" "5230" "5232" "5236" "5313" "5315" "55276"
## [57] "55902" "57818" "669" "7167" "80201" "83440" "84532" "8789"
## [65] "92483" "92579" "9562"
##
## $hsa00020
## [1] "1431" "1737" "1738" "1743" "2271" "3417" "3418" "3419" "3420"
## [10] "3421" "4190" "4191" "47" "48" "4967" "50" "5091" "5105"
## [19] "5106" "5160" "5161" "5162" "55753" "6389" "6390" "6391" "6392"
## [28] "8801" "8802" "8803"
great(gr, kegg_gs, "hg19")
## 1726 regions are associated to 2268 genes' extended TSSs.
## TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
## Genome: hg19
## OrgDb: org.Hs.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Manually set gene sets and TSS
Mainly for non-model/less well studied organisms.
Two inputs:
- Gene sets: A list of gene vectors.
- TSS/gene: A
GRanges
object with agene_id
column. Gene IDs ingene_id
column must match to genes in the gene sets.
Extending TSS needs the chromosome lengths.
- with
gene
as input without additional chromosome length information
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
= genes(TxDb.Hsapiens.UCSC.hg19.knownGene) gene
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
= gene[seqnames(gene) %in% paste0("chr", c(1:22, "X", "Y"))]
gene = sort(gene)
gene # remove seqlength informatio
= GRanges(seqnames = as.vector(seqnames(gene)), ranges = ranges(gene),
gene strand = strand(gene), gene_id = gene$gene_id)
great(gr, gs, gene)
## ! You havn't specified neither `seqlengths` nor `genome`, take the
## maximal ends of genes on chromosomes as the chromosome lengths.
## * TSS extension mode is 'basalPlusExt'.
## * construct the basal domains by extending 5000bp to upstream and 1000bp to downsteram of TSS.
## * calculate distances to neighbour regions.
## * extend to both sides until reaching the neighbour genes or to the maximal extension.
## * check gene ID type in `gene_sets` and in `extended_tss`.
## * use whole genome as background.
## * remove excluded regions from background.
## * overlap `gr` to background regions (based on midpoint).
## * in total 1726 `gr`.
## * overlap extended TSS to background regions.
## * check which genes are in the gene sets.
## * only take gene sets with size >= 5.
## * in total 84 gene sets.
## * overlap `gr` to every extended TSS.
## * perform binomial test for each biological term.
## 1726 regions are associated to 2395 genes' extended TSSs.
## TSS source: self-provided
## Genome: unknown
## Gene sets: self-provided
## Background: whole genome
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
In this case, the end position of the last gene on each chromosome is treated as the end of the chromosome.
- with
gene
whereseqlengths()
is already set:
seqlengths(gene) = c(chr1 = 249250621, chr2 = 243199373, chr3 = 198022430, chr4 = 191154276,
chr5 = 180915260, chr6 = 171115067, chr7 = 159138663, chr8 = 146364022,
chr9 = 141213431, chr10 = 135534747, chr11 = 135006516, chr12 = 133851895,
chr13 = 115169878, chr14 = 107349540, chr15 = 102531392, chr16 = 90354753,
chr17 = 81195210, chr18 = 78077248, chr19 = 59128983, chr20 = 63025520,
chr21 = 48129895, chr22 = 51304566, chrX = 155270560, chrY = 59373566)
great(gr, gs, gene)
## 1726 regions are associated to 2395 genes' extended TSSs.
## TSS source: self-provided
## Genome: unknown
## Gene sets: self-provided
## Background: whole genome
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Set background regions
There are two ways to set “background/mask” regions: exclude
and background
.
It is very common that gap regions in the genome are removed from the analysis. In the following example, getGapFromUCSC()
can be used to retrieve gap regions from UCSC table browser.
= getGapFromUCSC("hg19", paste0("chr", c(1:22, "X", "Y")))
gap great(gr, "MSigDB:H", "hg19", exclude = gap)
Or only in a set of pre-defined background regions, e.g. regionts with similar CpG densities as input regions.
= ... # from upstream analysis, a GRanges object
background_regions great(gr, "MSigDB:H", "hg19", background = background_regions)
Note as the same as online GREAT, rGREAT by default excludes gap regions from the analysis.
Alternatively, background
and exclude
can also be set to a vector of chromosome names, then the whole selected chromosomes will be included/excluded from the analysis.
great(gr, "GO:BP", "hg19", background = paste0("chr", 1:22))
great(gr, "GO:BP", "hg19", exclude = c("chrX", "chrY"))
Get enrichment table
Simply use getEnrichmentTable()
function.
= getEnrichmentTable(res_bp)
tb head(tb)
## id description genome_fraction
## 1 GO:0097190 apoptotic signaling pathway 0.05113446
## 2 GO:0042981 regulation of apoptotic process 0.13585729
## 3 GO:0043067 regulation of programmed cell death 0.13812704
## 4 GO:0010941 regulation of cell death 0.15338546
## 5 GO:0006915 apoptotic process 0.17420944
## 6 GO:0012501 programmed cell death 0.17749351
## observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist
## 1 176 1.994152 0 0 110109
## 2 384 1.637599 0 0 125275
## 3 385 1.614884 0 0 126229
## 4 413 1.560002 0 0 123382
## 5 462 1.536489 0 0 123653
## 6 468 1.527646 0 0 124195
## observed_gene_hits gene_set_size fold_enrichment_hyper p_value_hyper
## 1 110 579 1.554960 1.289313e-06
## 2 271 1393 1.592294 5.551115e-16
## 3 273 1421 1.572439 2.220446e-15
## 4 295 1571 1.536919 3.108624e-15
## 5 334 1824 1.498742 1.332268e-15
## 6 341 1876 1.487739 1.998401e-15
## p_adjust_hyper
## 1 2.898304e-05
## 2 1.366407e-13
## 3 4.204329e-13
## 4 5.668059e-13
## 5 2.981252e-13
## 6 3.935252e-13
There are also columns for hypergeometric test on the numbers of genes, the same as in the original GREAT method.
There is a new column “mean_tss_dist” in the result table which is the mean absolute distance of input regions to TSS of genes in a gene set. Please note with larger distance to TSS, the more we need to be careful with the reliability of the associations between input regions to genes.
Make volcano plot
In differential gene expression analysis, volcano plot is used to visualize relations between log2 fold change and (adjusted) p-values. Similarly, we can also use volcano plot to visualize relations between fold enrichment and (adjusted) p-values for the enrichment analysis. The plot is made by the function plotVolcano()
:
plotVolcano(res_bp)
As the enrichment analysis basically only looks for over-representations, it is actually half volcano.
Get region-gene associations
plotRegionGeneAssociations()
generates three plots similar as those by online GREAT. getRegionGeneAssociations()
returns a GRanges
object containing associations between regions and genes.
plotRegionGeneAssociations(res_bp)
getRegionGeneAssociations(res_bp)
## GRanges object with 1717 ranges and 2 metadata columns:
## seqnames ranges strand | annotated_genes dist_to_TSS
## <Rle> <IRanges> <Rle> | <CharacterList> <IntegerList>
## [1] chr1 936221-936347 * | HES4 -669
## [2] chr1 1280252-1280492 * | TAS1R3,DVL1 13526,4000
## [3] chr1 2143913-2144153 * | FAAP20 -4741
## [4] chr1 4052186-4052426 * | C1orf174,AJAP1 -235329,-662679
## [5] chr1 4440698-4440938 * | C1orf174,AJAP1 -623841,-274167
## ... ... ... ... . ... ...
## [1713] chr22 39096963-39097203 * | JOSD1,GTPBP1 -504,-4604
## [1714] chr22 39704855-39705095 * | PDGFB,RPL3 -63898,11296
## [1715] chr22 40894508-40894748 * | SGSM3,MRTFA 127913,137942
## [1716] chr22 43011250-43011490 * | POLDIP3 -288
## [1717] chr22 50364455-50364695 * | PIM3,IL17REL 10312,86360
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
Being different from the online GREAT, getRegionGeneAssociations()
on the local GREAT object calculates the distance to TSS based on the borders of the input regions. The argument by_middle_points
can be set to TRUE
to let the distance be based on the middle points of input regions.
Please note the two meta columns are in formats of CharacterList
and IntegerList
because a region may associate to multiple genes.
Of course you can set a specific geneset term by argument term_id
.
plotRegionGeneAssociations(res_bp, term_id = "GO:0006915")
getRegionGeneAssociations(res_bp, term_id = "GO:0006915")
## GRanges object with 463 ranges and 2 metadata columns:
## seqnames ranges strand | annotated_genes dist_to_TSS
## <Rle> <IRanges> <Rle> | <CharacterList> <IntegerList>
## [1] chr1 10440739-10440979 * | KIF1B 169975
## [2] chr1 16472855-16473095 * | EPHA2 9487
## [3] chr1 16490874-16491114 * | EPHA2 -8292
## [4] chr1 23073949-23074189 * | KDM1A -271752
## [5] chr1 23912118-23912358 * | ID3,RPL11 -25833,-105911
## ... ... ... ... . ... ...
## [459] chr20 62330537-62330777 * | TNFRSF6B 3406
## [460] chr21 44897828-44898068 * | SIK1,RRP1B -50826,-181364
## [461] chr22 21258958-21259198 * | CRKL -12516
## [462] chr22 37921455-37921695 * | CARD10 -6077
## [463] chr22 50364455-50364695 * | PIM3 10312
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
With local GREAT, it is possible to obtain associations between genes and all regions.
getRegionGeneAssociations(res_bp)
## GRanges object with 1717 ranges and 2 metadata columns:
## seqnames ranges strand | annotated_genes dist_to_TSS
## <Rle> <IRanges> <Rle> | <CharacterList> <IntegerList>
## [1] chr1 936221-936347 * | HES4 -669
## [2] chr1 1280252-1280492 * | TAS1R3,DVL1 13526,4000
## [3] chr1 2143913-2144153 * | FAAP20 -4741
## [4] chr1 4052186-4052426 * | C1orf174,AJAP1 -235329,-662679
## [5] chr1 4440698-4440938 * | C1orf174,AJAP1 -623841,-274167
## ... ... ... ... . ... ...
## [1713] chr22 39096963-39097203 * | JOSD1,GTPBP1 -504,-4604
## [1714] chr22 39704855-39705095 * | PDGFB,RPL3 -63898,11296
## [1715] chr22 40894508-40894748 * | SGSM3,MRTFA 127913,137942
## [1716] chr22 43011250-43011490 * | POLDIP3 -288
## [1717] chr22 50364455-50364695 * | PIM3,IL17REL 10312,86360
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
The Shiny application
shinyReport()
creates a Shiny application to view the complete results:
shinyReport(res_bp)
Work with other organisms
Use BioMart genes and gene sets
rGREAT also supports GO gene sets for a huge number of organisms retrieved from Ensembl BioMart. A specific organism can be set with the biomart_dataset
argument:
# Giant panda
= randomRegionsFromBioMartGenome("amelanoleuca_gene_ensembl")
gr_panda gr_panda
## GRanges object with 993 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 9 316703-326582 *
## [2] 9 4025898-4029990 *
## [3] 9 7512107-7513962 *
## [4] 9 8093520-8102042 *
## [5] 9 8840932-8850802 *
## ... ... ... ...
## [989] 19 30864977-30865979 *
## [990] 19 32737775-32747467 *
## [991] 19 33398878-33402107 *
## [992] LNAT02007500.1 392610-399562 *
## [993] LNAT02006054.1 1549328-1558943 *
## -------
## seqinfo: 2040 sequences from an unspecified genome; no seqlengths
great(gr_panda, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")
## 960 regions are associated to 1491 genes' extended TSSs.
## TSS source: amelanoleuca_gene_ensembl
## Genome: Giant panda genes (ASM200744v2)
## Gene sets: GO:BP
## Background: whole genome
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Note both TSS and gene sets are from BioMart. The value for gene sets (the second argument) can only be one of "GO:BP"
, "GO:CC"
and "GO:MF"
.
rGREAT now supports more than 600 organisms. The complete list can be found with BioMartGOGeneSets::supportedOrganisms()
. Be careful with the genome version and format of chromsome names.
For non-model organisms, the default chromosomes may include many small contigs. You can set the background
argument to a vector of chromosomes that you want to include.
Use MSigDB gene sets
MSigDB contains gene sets only for human, but it can be extended to other organisms by mapping to the homologues genes. The package msigdbr has already mapped genes to many other organisms. A full list of supported organisms in msigdbr can be obtained by:
library(msigdbr)
msigdbr_species()
## # A tibble: 20 × 2
## species_name species_common_name
## <chr> <chr>
## 1 Anolis carolinensis Carolina anole, green anole
## 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
## 3 Caenorhabditis elegans <NA>
## 4 Canis lupus familiaris dog, dogs
## 5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
## 6 Drosophila melanogaster fruit fly
## 7 Equus caballus domestic horse, equine, horse
## 8 Felis catus cat, cats, domestic cat
## 9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
## 10 Homo sapiens human
## 11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica gray short-tailed opossum
## 13 Mus musculus house mouse, mouse
## 14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes chimpanzee
## 16 Rattus norvegicus brown rat, Norway rat, rat, rats
## 17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
## 18 Schizosaccharomyces pombe 972h- <NA>
## 19 Sus scrofa pig, pigs, swine, wild boar
## 20 Xenopus tropicalis tropical clawed frog, western clawed frog
To obtain gene sets for non-human organisms, e.g.:
= msigdbr(species = "chimpanzee", category = "H")
h_gene_sets head(h_gene_sets)
## # A tibble: 6 × 18
## gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 H "" HALLMARK_ADIPOGENESIS ABCA1 464630 ENSPTRG0000002…
## 2 H "" HALLMARK_ADIPOGENESIS ABCB8 463892 ENSPTRG0000001…
## 3 H "" HALLMARK_ADIPOGENESIS ACAA2 455414 ENSPTRG0000001…
## 4 H "" HALLMARK_ADIPOGENESIS ACADL 459914 ENSPTRG0000001…
## 5 H "" HALLMARK_ADIPOGENESIS ACADM 469356 ENSPTRG0000000…
## 6 H "" HALLMARK_ADIPOGENESIS ACADS 742921 ENSPTRG0000000…
## # ℹ 12 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## # human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## # gs_exact_source <chr>, gs_url <chr>, gs_description <chr>, taxon_id <int>,
## # ortholog_sources <chr>, num_ortholog_sources <dbl>
If the organism you selected has a corresponding TxDb package available which provides TSS information, you need to make sure the gene sets use Entrez gene ID as the identifier (Most TxDb packages use Entrez ID as primary ID, you can check the variable rGREAT:::BIOC_ANNO_PKGS
).
# convert to a list of gene sets
= split(h_gene_sets$entrez_gene, h_gene_sets$gs_name)
h_gene_sets = lapply(h_gene_sets, as.character) # just to make sure gene IDs are all in character.
h_gene_sets 1:2] h_gene_sets[
## $HALLMARK_ADIPOGENESIS
## [1] "464630" "463892" "455414" "459914" "469356" "742921"
## [7] "454672" "104003784" "454895" "451866" "737339" "471032"
## [13] "451742" "737305" "100615914" "456723" "107967644" "454362"
## [19] "464334" "743667" "741867" "449586" "100614256" "741708"
## [25] "459164" "746692" "473976" "452433" "468889" "745443"
## [31] "460926" "455644" "451116" "454684" "744890" "461229"
## [37] "740513" "104005232" "463949" "469319" "748673" "450673"
## [43] "468605" "471455" "456837" "464611" "452659" "472079"
## [49] "452307" "454118" "100616508" "465727" "742828" "737945"
## [55] "107976794" "107976794" "746229" "472893" "456557" "457056"
## [61] "747265" "736777" "464460" "451393" "745691" "454512"
## [67] "466780" "463861" "744984" "452566" "457117" NA
## [73] "747936" "459360" "461436" "464353" "464074" "466651"
## [79] "451984" "456243" "464255" "467738" "466732" "461244"
## [85] "456929" "460520" "450562" "450738" "464140" "459670"
## [91] "452976" "471703" "741876" "471135" "461424" "459828"
## [97] "452295" "460113" "453565" "741179" "747276" "470423"
## [103] "451967" "450290" "473975" "473975" "460157" "462946"
## [109] "449638" "738797" "456076" "451807" "464031" "739986"
## [115] "459173" "460872" "463484" "462853" "739167" "457477"
## [121] "742027" "746245" "472764" "747387" "744096" "101057233"
## [127] "744811" "463686" "744435" "468748" "451175" "460227"
## [133] "454744" "739996" NA "450735" "454478" "457929"
## [139] "738397" "458602" "456908" "451591" "450310" "107970333"
## [145] "465012" "463481" "463481" "460178" "470365" "742092"
## [151] "741184" "459094" "459374" "456940" "745779" "454531"
## [157] "737918" "107973114" "742100" "470420" "468499" "467657"
## [163] "100608935" "462416" "451281" "470281" "470281" "452359"
## [169] "456862" "456526" "747462" "474051" "456155" "458647"
## [175] "744390" "455841" "459096" "459031" "450574" "449637"
## [181] "450628" "470477" "471247" "453405" "739128" "454681"
## [187] "464707" "470417" "450933" "459685" "460443" "468406"
## [193] "458803" "467151" "464550" "745004" "451416" "735808"
## [199] "743144" "460348" "107974864" "471631" "741897" "463489"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "454210" "461523" "450363" "100609296" "459646" "740898"
## [7] "466415" "450170" "465345" "456984" "744209" "449497"
## [13] "100608992" "459361" "741390" "468208" "748142" "473220"
## [19] "748205" "736543" "454593" "747004" "454579" "747123"
## [25] "462689" "460323" "740071" "450128" "469524" "449512"
## [31] "748272" "470617" "451584" "742330" "451585" "450124"
## [37] "100615583" "473802" "460569" "745293" "462191" "740560"
## [43] "470892" "470900" "735755" "470426" "460577" "465021"
## [49] "465607" "736196" "457127" "453745" "457277" "738275"
## [55] "471200" NA "459634" "457770" "469142" "463415"
## [61] "466216" "458797" "453714" "469204" "750603" NA
## [67] "100615835" "740028" "451695" "451158" "471510" "738331"
## [73] "469584" "739516" "465940" "461906" "468521" "457003"
## [79] "472959" "457020" "467610" "461873" "452825" "460623"
## [85] "463280" "746195" "750725" "100608816" "449592" "471979"
## [91] "471977" "471974" "462591" "462540" "494187" "450196"
## [97] "474132" "473965" "449517" "747276" "470077" "743102"
## [103] "472749" "469657" "736204" "460816" "471723" "455851"
## [109] "449564" "737808" "449644" "739011" "744277" "450200"
## [115] "461472" "456370" "450884" "470203" "101059843" "449565"
## [121] "454005" "463288" "464245" "745517" "463371" "470524"
## [127] "462386" "450927" "454294" "454045" "458607" "735556"
## [133] "464979" "450156" "738375" "456715" "471734" "736309"
## [139] "744486" "459682" "745667" "472771" "462888" "748652"
## [145] "449582" "458294" "460699" "459239" "741196" "460720"
## [151] "100322885" "469743" "455026" "740704" "740477" "450512"
## [157] "453993" "456276" "743176" "748032" "457607" "462249"
## [163] "464277" "737451" "746600" "737526" "456065" "461536"
## [169] "107966305" "746721" "737070" "459209" "451169" "450503"
## [175] "461971" "461023" "459834" "100610925" "471978" "746399"
## [181] "746814" "456060" "457742" "451611" "107971092" "461167"
## [187] "471325" "471374" "471167" "494186" "748737" "464876"
## [193] "741922" "745141" "452125" "453161" "743187" "459427"
Now we can perform the local GREAT analysis.
= randomRegions(genome = "panTro6")
gr great(gr, h_gene_sets, "panTro6")
## 990 regions are associated to 659 genes' extended TSSs.
## TSS source: TxDb.Ptroglodytes.UCSC.panTro6.refGene
## Genome: panTro6
## OrgDb: org.Pt.eg.db
## Gene sets: self-provided
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Use NCBI genomes
NCBI provides genome data for a huge number of organisms. It is quite easy to obtain the GO gene sets for a specific organism with the AnnotationHub package where NCBI is also the data source.
Taking dolphin (Tursiops truncatus) as an example.
library(AnnotationHub)
= AnnotationHub() ah
display(ah)
# id for the AnnotationHub dataset
= ah[["AH108106"]]
org_db org_db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Tursiops truncatus
## | SPECIES: Tursiops truncatus
## | CENTRALID: GID
## | Taxonomy ID: 9739
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
= mapIds(org_db, keytype = "ONTOLOGY", keys = "BP",
all_bp_ids column = "GO", multiVals = "list")[[1]]
head(all_bp_ids)
## [1] "GO:0006119" "GO:0015986" "GO:0042776" "GO:0019646" "GO:0008535"
## [6] "GO:0006120"
We select "GOALL"
so that gene set for each GO term includes genes from all its offspring terms.
= mapIds(org_db, keytype = "GOALL", keys = all_bp_ids,
bp_genesets column = "ENTREZID", multiVals = "list")
1:2] bp_genesets[
## $`GO:0006119`
## [1] "7412047" "7412050" "7412051" "7412052" "7412053" "7412054"
## [7] "7412055" "7412056" "7412058" "7412059" "101315513" "101316029"
## [13] "101316487" "101316542" "101316570" "101317856" "101318351" "101318829"
## [19] "101318867" "101319016" "101319704" "101320192" "101320707" "101322629"
## [25] "101323037" "101323468" "101325919" "101326733" "101326830" "101327630"
## [31] "101329052" "101331837" "101332916" "101333806" "101334086" "101334550"
## [37] "101335214" "101335403" "101336771" "101337012" "101337690" "101337707"
## [43] "101339027" "101339993" "109547368" "109549348" "109549675"
##
## $`GO:0015986`
## [1] "7412049" "7412050" "101315513" "101316487" "101318351" "101319550"
## [7] "101320149" "101321264" "101326149" "101331575" "101332310" "101334550"
## [13] "101336294"
Unfortunately, there is no processed data of genome and genes for many other organisms, but these data is available from NCBI. The function getGenomeDataFromNCBI()
accepts a RefSeq accession number for a given genome assembly and returns information of the genome as well as the genes.
The RefSeq accession number can be obtained from https://www.ncbi.nlm.nih.gov/data-hub/genome/. The accession number for dolphin is “GCF_011762595.1” and its genome page is at https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_011762595.1/ (always be careful with the genome/assembly version). We can get its genes with the command:
= getGenomeDataFromNCBI("GCF_011762595.1", return_granges = TRUE)
genes genes
## GRanges object with 19081 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] 1 234018-239154 + | 101335869
## [2] 1 241768-255021 + | 101336161
## [3] 1 259970-287776 + | 101336448
## [4] 1 288745-298452 + | 101336738
## [5] 1 325391-327082 - | 101321052
## ... ... ... ... . ...
## [19077] MT 9921-10217 + | 7412053
## [19078] MT 10211-11588 + | 7412054
## [19079] MT 11790-13610 + | 7412055
## [19080] MT 13594-14121 - | 7412056
## [19081] MT 14195-15334 + | 7412057
## -------
## seqinfo: 23 sequences from an unspecified genome
The good thing is dolphin genome is already assembled on the chromosome level, so the chromosome length information is already included in genes
:
seqlengths(genes)
## 1 2 3 4 5 6 7 8
## 183742880 176472748 171421769 144183200 137487549 115599616 114015307 108430135
## 9 10 11 12 13 14 15 16
## 105335178 102812509 102167733 89330967 88485069 88483154 86293990 83508544
## 17 18 19 20 21 X MT
## 78442772 78151963 58636127 58467291 35682248 128693445 16388
And we can directly use genes
in great()
.
= randomRegions(seqlengths = seqlengths(genes))
gr gr
## GRanges object with 1000 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] 1 5026741-5030722 *
## [2] 1 5244108-5245358 *
## [3] 1 8202579-8204950 *
## [4] 1 9039213-9047046 *
## [5] 1 10218811-10220154 *
## ... ... ... ...
## [996] X 120273682-120276320 *
## [997] X 124498419-124508365 *
## [998] X 125121496-125125278 *
## [999] X 125555571-125562044 *
## [1000] X 127437523-127441364 *
## -------
## seqinfo: 23 sequences from an unspecified genome; no seqlengths
great(gr, bp_genesets, genes)
## 1000 regions are associated to 1574 genes' extended TSSs.
## TSS source: self-provided
## Genome: unknown
## Gene sets: self-provided
## Background: whole genome
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
Session info
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur/Monterey 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] AnnotationHub_3.6.0
## [2] BiocFileCache_2.6.1
## [3] dbplyr_2.3.3
## [4] msigdbr_7.5.1
## [5] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
## [6] GenomicFeatures_1.50.4
## [7] AnnotationDbi_1.60.2
## [8] Biobase_2.58.0
## [9] rGREAT_2.3.1
## [10] GenomicRanges_1.50.2
## [11] GenomeInfoDb_1.34.9
## [12] IRanges_2.32.0
## [13] S4Vectors_0.36.2
## [14] BiocGenerics_0.44.0
## [15] knitr_1.43
## [16] rmarkdown_2.23
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7
## [2] matrixStats_1.0.0
## [3] bit64_4.0.5
## [4] filelock_1.0.2
## [5] doParallel_1.0.17
## [6] RColorBrewer_1.1-3
## [7] progress_1.2.2
## [8] httr_1.4.6
## [9] tools_4.2.0
## [10] bslib_0.5.0
## [11] utf8_1.2.3
## [12] R6_2.5.1
## [13] DT_0.28
## [14] DBI_1.1.3
## [15] colorspace_2.1-0
## [16] GetoptLong_1.1.0
## [17] withr_2.5.0
## [18] tidyselect_1.2.0
## [19] prettyunits_1.1.1
## [20] bit_4.0.5
## [21] curl_5.0.1
## [22] compiler_4.2.0
## [23] cli_3.6.1
## [24] xml2_1.3.5
## [25] DelayedArray_0.24.0
## [26] rtracklayer_1.58.0
## [27] sass_0.4.7
## [28] rappdirs_0.3.3
## [29] stringr_1.5.0
## [30] digest_0.6.33
## [31] Rsamtools_2.14.0
## [32] XVector_0.38.0
## [33] pkgconfig_2.0.3
## [34] htmltools_0.5.5
## [35] MatrixGenerics_1.10.0
## [36] highr_0.10
## [37] fastmap_1.1.1
## [38] htmlwidgets_1.6.2
## [39] rlang_1.1.1
## [40] GlobalOptions_0.1.2
## [41] RSQLite_2.3.1
## [42] shiny_1.7.4.1
## [43] prettydoc_0.4.1
## [44] shape_1.4.6
## [45] jquerylib_0.1.4
## [46] BiocIO_1.8.0
## [47] generics_0.1.3
## [48] jsonlite_1.8.7
## [49] BiocParallel_1.32.6
## [50] BioMartGOGeneSets_0.99.9
## [51] dplyr_1.1.2
## [52] RCurl_1.98-1.12
## [53] magrittr_2.0.3
## [54] GO.db_3.16.0
## [55] GenomeInfoDbData_1.2.9
## [56] Matrix_1.6-0
## [57] Rcpp_1.0.11
## [58] fansi_1.0.4
## [59] babelgene_22.9
## [60] lifecycle_1.0.3
## [61] org.Pt.eg.db_3.16.0
## [62] TxDb.Ptroglodytes.UCSC.panTro6.refGene_3.10.0
## [63] stringi_1.7.12
## [64] yaml_2.3.7
## [65] SummarizedExperiment_1.28.0
## [66] zlibbioc_1.44.0
## [67] org.Hs.eg.db_3.16.0
## [68] grid_4.2.0
## [69] blob_1.2.4
## [70] promises_1.2.0.1
## [71] parallel_4.2.0
## [72] crayon_1.5.2
## [73] lattice_0.21-8
## [74] Biostrings_2.66.0
## [75] circlize_0.4.16
## [76] hms_1.1.3
## [77] KEGGREST_1.38.0
## [78] pillar_1.9.0
## [79] rjson_0.2.21
## [80] codetools_0.2-19
## [81] biomaRt_2.54.1
## [82] BiocVersion_3.16.0
## [83] XML_3.99-0.14
## [84] glue_1.6.2
## [85] evaluate_0.21
## [86] BiocManager_1.30.21
## [87] httpuv_1.6.11
## [88] png_0.1-8
## [89] vctrs_0.6.3
## [90] foreach_1.5.2
## [91] purrr_1.0.1
## [92] cachem_1.0.8
## [93] xfun_0.39
## [94] mime_0.12
## [95] xtable_1.8-4
## [96] restfulr_0.0.15
## [97] later_1.3.1
## [98] tibble_3.2.1
## [99] TxDb.Hsapiens.UCSC.hg38.knownGene_3.16.0
## [100] iterators_1.0.14
## [101] GenomicAlignments_1.34.1
## [102] memoise_2.0.1
## [103] interactiveDisplayBase_1.36.0
## [104] ellipsis_0.3.2