Topic 3-02: Local GREAT analysis

Basic usage

First let’s load the rGREAT package and generate a set of random regions:

library(rGREAT)

## Loading required package: GenomicRanges

## Loading required package: stats4

## Loading required package: BiocGenerics

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
##     table, tapply, union, unique, unsplit, which.max, which.min

## Loading required package: S4Vectors

## 
## Attaching package: 'S4Vectors'

## The following object is masked from 'package:utils':
## 
##     findMatches

## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname

## Loading required package: IRanges

## Loading required package: GenomeInfoDb

##

##

## ========================================
## rGREAT version 2.8.0
## Bioconductor page: http://bioconductor.org/packages/rGREAT/
## Github page: https://github.com/jokergoo/rGREAT
## 
## If you use it in published research, please cite:
## Gu, Z. rGREAT: an R/Bioconductor package for functional enrichment 
##   on genomic regions. Bioinformatics 2023.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(rGREAT))
## 
## Note: On Aug 19 2019 GREAT released version 4 where it supports `hg38`
## genome and removes some ontologies such pathways. `submitGreatJob()`
## still takes `hg19` as default. `hg38` can be specified by the `genome
## = 'hg38'` argument. To use the older versions such as 3.0.0, specify as
## `submitGreatJob(..., version = '3.0.0')`.
## 
## From rGREAT version 1.99.0, it also implements the GREAT algorithm and
## it allows to integrate GREAT analysis with the Bioconductor annotation
## ecosystem. By default it supports more than 500 organisms. Check the
## new function `great()` and the new vignette.
## ========================================

set.seed(123)
gr = randomRegions(nr = 1000, genome = "hg19")

GREAT analysis needs two types of input data: the gene sets and TSS source. Thus the second argument in great() corresponds to the gene sets, and the third argumetn corresponds to the source of TSS.

rGREAT has integrated many gene sets and genomes.

res = great(gr, "MSigDB:H", "hg19")

Following functions are very similar as in online GREAT.

tb = getEnrichmentTable(res)
plotVolcano(res)

plotRegionGeneAssociations(res)

getRegionGeneAssociations(res)

## GRanges object with 869 ranges and 2 metadata columns:
##         seqnames            ranges strand | annotated_genes    dist_to_TSS
##            <Rle>         <IRanges>  <Rle> | <CharacterList>  <IntegerList>
##     [1]     chr1   9204434-9208784      * |     GPR157,H6PD  -15205,-86079
##     [2]     chr1   9853594-9859363      * |   PIK3CD,CLSTN1   141804,25187
##     [3]     chr1 10862809-10871681      * |  CASZ1,C1orf127   -6076,170413
##     [4]     chr1 12716970-12723206      * | AADACL4,AADACL3   12404,-52912
##     [5]     chr1 13814692-13823250      * | PRAMEF17,LRRC38    98604,16992
##     ...      ...               ...    ... .             ...            ...
##   [865]     chrY 16384799-16389852      * |   TMSB4Y,NLGN4Y 569352,-244636
##   [866]     chrY 16842593-16846893      * |          NLGN4Y         208105
##   [867]     chrY 16998750-17003225      * |          NLGN4Y         364262
##   [868]     chrY 24401612-24406252      * |   RBMY1E,RBMY1F  -72523,-48754
##   [869]     chrY 26206997-26215445      * |            DAZ2         841416
##   -------
##   seqinfo: 26 sequences from an unspecified genome; no seqlengths

plotRegionGeneAssociations(res, term_id = "HALLMARK_APOPTOSIS")

# shinyReport(res)

The following gene sets are directly supported:

"GO:BP": Biological Process.
"GO:CC": Cellular Component.
"GO:MP": Molecular Function.

Note MSigDB genes are only for human.

"msigdb:H" Hallmark gene sets.
"msigdb:C1" Positional gene sets.
"msigdb:C2" Curated gene sets.
"msigdb:C2:CGP" C2 subcategory: chemical and genetic perturbations gene sets.
"msigdb:C2:CP" C2 subcategory: canonical pathways gene sets.
"msigdb:C2:CP:KEGG" C2 subcategory: KEGG subset of CP.
"msigdb:C2:CP:PID" C2 subcategory: PID subset of CP.
"msigdb:C2:CP:REACTOME" C2 subcategory: REACTOME subset of CP.
"msigdb:C2:CP:WIKIPATHWAYS" C2 subcategory: WIKIPATHWAYS subset of CP.
"msigdb:C3" Regulatory target gene sets.
"msigdb:C3:MIR:MIRDB" miRDB of microRNA targets gene sets.
"msigdb:C3:MIR:MIR_LEGACY" MIR_Legacy of MIRDB.
"msigdb:C3:TFT:GTRD" GTRD transcription factor targets gene sets.
"msigdb:C3:TFT:TFT_LEGACY" TFT_Legacy.
"msigdb:C4" Computational gene sets.
"msigdb:C4:CGN" C4 subcategory: cancer gene neighborhoods gene sets.
"msigdb:C4:CM" C4 subcategory: cancer modules gene sets.
"msigdb:C5" Ontology gene sets.
"msigdb:C5:GO:BP" C5 subcategory: BP subset.
"msigdb:C5:GO:CC" C5 subcategory: CC subset.
"msigdb:C5:GO:MF" C5 subcategory: MF subset.
"msigdb:C5:HPO" C5 subcategory: human phenotype ontology gene sets.
"msigdb:C6" Oncogenic signature gene sets.
"msigdb:C7" Immunologic signature gene sets.
"msigdb:C7:IMMUNESIGDB" ImmuneSigDB subset of C7.
"msigdb:C7:VAX" C7 subcategory: vaccine response gene sets.
"msigdb:C8" Cell type signature gene sets.

The "GO:" and “msigdb:” prefix can be removed when specifying in great().

Setting genome background

Background can be set via backgroud and excluce arguments.

gap = getGapFromUCSC("hg19", paste0("chr", c(1:22, "X", "Y")))
great(gr, "MSigDB:H", "hg19", exclude = gap)

great(gr, "GO:BP", "hg19", background = paste0("chr", 1:22))
great(gr, "GO:BP", "hg19", exclude = c("chrX", "chrY"))

Restricting the anallysis in a certain type of genome background helps to reduce the false positives. Examples are:

excluding unmappable regions
regions with certain level of CG density
regions with certain level of gene densities
exons (coding regions)

Extend from the complete gene

This feature is only available in the devel version of rGREAT.

great(gr, "BP", "hg19", extend_from = "gene")

Using other gene sets

More gene sets can be set as a list of gene vectors. For most organisms, EntreZ IDs should be used as the gene ID type in gene sets.

From a .gmt file:

gs = read_gmt(url("http://dsigdb.tanlab.org/Downloads/D2_LINCS.gmt"), 
    from = "SYMBOL", to = "ENTREZ", orgdb = "org.Hs.eg.db")
gs[1:2]

## $GSK429286A
##  [1] "5566"  "9475"  "81788" "6093"  "5562"  "26524" "5592"  "5979"  "640"  
## [10] "5567"  "3717"  "5613"  "23012" "695"   "3718" 
## 
## $`BS-181`
##  [1] "122011" "1452"   "1022"   "7272"   "1454"   "1453"   "65975"  "147746"
##  [9] "1859"   "1195"   "1196"   "9149"   "57396"  "5261"

great(gr, gs, "hg19")

## 916 regions are associated to 1400 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

KEGG pathways:

df = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
df[, 1] = gsub("hsa:", "", df[, 1])
df[, 2] = gsub("path:",  "", df[, 2])
gs_kegg = split(df[, 1], df[, 2])
gs_kegg[1:2]

## $hsa00010
##  [1] "10327"  "124"    "125"    "126"    "127"    "128"    "130"    "130589"
##  [9] "131"    "160287" "1737"   "1738"   "2023"   "2026"   "2027"   "217"   
## [17] "218"    "219"    "2203"   "221"    "222"    "223"    "224"    "226"   
## [25] "229"    "230"    "2538"   "2597"   "26330"  "2645"   "2821"   "3098"  
## [33] "3099"   "3101"   "387712" "3939"   "3945"   "3948"   "441531" "501"   
## [41] "5105"   "5106"   "5160"   "5161"   "5162"   "5211"   "5213"   "5214"  
## [49] "5223"   "5224"   "5230"   "5232"   "5236"   "5313"   "5315"   "55276" 
## [57] "55902"  "57818"  "669"    "7167"   "80201"  "83440"  "84532"  "8789"  
## [65] "92483"  "92579"  "9562"  
## 
## $hsa00020
##  [1] "1431"  "1737"  "1738"  "1743"  "2271"  "3417"  "3418"  "3419"  "3420" 
## [10] "3421"  "4190"  "4191"  "47"    "48"    "4967"  "50"    "5091"  "5105" 
## [19] "5106"  "5160"  "5161"  "5162"  "55753" "6389"  "6390"  "6391"  "6392" 
## [28] "8801"  "8802"  "8803"

great(gr, gs_kegg, "hg19")

## 916 regions are associated to 1400 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Reactome pathways:

library(reactome.db)

## Loading required package: AnnotationDbi

## Loading required package: Biobase

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

gs_reactome = as.list(reactomePATHID2EXTID)
great(gr, gs_reactome, "hg19")

## 916 regions are associated to 1400 genes' extended TSSs.
##   TSS source: TxDb.Hsapiens.UCSC.hg19.knownGene
##   Genome: hg19
##   OrgDb: org.Hs.eg.db
##   Gene sets: self-provided
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

Other organisms who have txdb packages

On Bioconductor, there are a family of annotation packages with name txdb.*.db. These packages contain sources of TSS definitions.

All supported txdb.*.db packages:

tb = rGREAT:::BIOC_ANNO_PKGS
knitr::kable(tb[!duplicated(tb$genome_version_in_txdb), ])

	species_name	species_latin	taxon_id	txdb	gene_id_in_txdb	genome_version_in_txdb	orgdb	primary_gene_id_in_orgdb	chromosome_prefix
1	human	Homo sapiens	9606	TxDb.Hsapiens.UCSC.hg18.knownGene	Entrez Gene ID	hg18	org.Hs.eg.db	Entrez Gene ID	chr
2	human	Homo sapiens	9606	TxDb.Hsapiens.UCSC.hg19.knownGene	Entrez Gene ID	hg19	org.Hs.eg.db	Entrez Gene ID	chr
3	human	Homo sapiens	9606	TxDb.Hsapiens.UCSC.hg38.knownGene	Entrez Gene ID	hg38	org.Hs.eg.db	Entrez Gene ID	chr
5	mouse	Mus musculus	10090	TxDb.Mmusculus.UCSC.mm10.knownGene	Entrez Gene ID	mm10	org.Mm.eg.db	Entrez Gene ID	chr
7	mouse	Mus musculus	10090	TxDb.Mmusculus.UCSC.mm39.refGene	Entrez Gene ID	mm39	org.Mm.eg.db	Entrez Gene ID	chr
8	mouse	Mus musculus	10090	TxDb.Mmusculus.UCSC.mm9.knownGene	Entrez Gene ID	mm9	org.Mm.eg.db	Entrez Gene ID	chr
9	rat	Rattus norvegicus	10116	TxDb.Rnorvegicus.UCSC.rn4.ensGene	Ensembl gene ID	rn4	org.Rn.eg.db	Entrez Gene ID	chr
10	rat	Rattus norvegicus	10116	TxDb.Rnorvegicus.UCSC.rn5.refGene	Entrez Gene ID	rn5	org.Rn.eg.db	Entrez Gene ID	chr
11	rat	Rattus norvegicus	10116	TxDb.Rnorvegicus.UCSC.rn6.refGene	Entrez Gene ID	rn6	org.Rn.eg.db	Entrez Gene ID	chr
12	rat	Rattus norvegicus	10116	TxDb.Rnorvegicus.UCSC.rn7.refGene	Entrez Gene ID	rn7	org.Rn.eg.db	Entrez Gene ID	chr
13	chicken	Gallus gallus	9031	TxDb.Ggallus.UCSC.galGal4.refGene	Entrez Gene ID	galGal4	org.Gg.eg.db	Entrez Gene ID	chr
14	chicken	Gallus gallus	9031	TxDb.Ggallus.UCSC.galGal5.refGene	Entrez Gene ID	galGal5	org.Gg.eg.db	Entrez Gene ID	chr
15	chicken	Gallus gallus	9031	TxDb.Ggallus.UCSC.galGal6.refGene	Entrez Gene ID	galGal6	org.Gg.eg.db	Entrez Gene ID	chr
16	rhesus	Macaca mulatta	9544	TxDb.Mmulatta.UCSC.rheMac10.refGene	Entrez Gene ID	rheMac10	org.Mmu.eg.db	Entrez Gene ID	chr
17	rhesus	Macaca mulatta	9544	TxDb.Mmulatta.UCSC.rheMac3.refGene	Entrez Gene ID	rheMac3	org.Mmu.eg.db	Entrez Gene ID	chr
18	rhesus	Macaca mulatta	9544	TxDb.Mmulatta.UCSC.rheMac8.refGene	Entrez Gene ID	rheMac8	org.Mmu.eg.db	Entrez Gene ID	chr
19	worm	Caenorhabditis elegans	6239	TxDb.Celegans.UCSC.ce11.refGene	Entrez Gene ID	ce11	org.Ce.eg.db	Entrez Gene ID	chr
21	dog	Canis familiari	9615	TxDb.Cfamiliaris.UCSC.canFam3.refGene	Entrez Gene ID	canFam3	org.Cf.eg.db	Entrez Gene ID	chr
22	dog	Canis familiari	9615	TxDb.Cfamiliaris.UCSC.canFam4.refGene	Entrez Gene ID	canFam4	org.Cf.eg.db	Entrez Gene ID	chr
23	dog	Canis familiari	9615	TxDb.Cfamiliaris.UCSC.canFam5.refGene	Entrez Gene ID	canFam5	org.Cf.eg.db	Entrez Gene ID	chr
24	pig	Sus scrofa	9823	TxDb.Sscrofa.UCSC.susScr11.refGene	Entrez Gene ID	susScr11	org.Ss.eg.db	Entrez Gene ID	chr
25	pig	Sus scrofa	9823	TxDb.Sscrofa.UCSC.susScr3.refGene	Entrez Gene ID	susScr3	org.Ss.eg.db	Entrez Gene ID	chr
26	yeast	Saccharomyces cerevisiae	4932	TxDb.Scerevisiae.UCSC.sacCer2.sgdGene	SGD Gene ID	sacCer2	org.Sc.sgd.db	SGD Gene ID	chr
27	yeast	Saccharomyces cerevisiae	4932	TxDb.Scerevisiae.UCSC.sacCer3.sgdGene	SGD Gene ID	sacCer3	org.Sc.sgd.db	SGD Gene ID	chr
28	chimp	Pan troglodytes	9598	TxDb.Ptroglodytes.UCSC.panTro4.refGene	Entrez Gene ID	panTro4	org.Pt.eg.db	Entrez Gene ID	chr
29	chimp	Pan troglodytes	9598	TxDb.Ptroglodytes.UCSC.panTro5.refGene	Entrez Gene ID	panTro5	org.Pt.eg.db	Entrez Gene ID	chr
30	chimp	Pan troglodytes	9598	TxDb.Ptroglodytes.UCSC.panTro6.refGene	Entrez Gene ID	panTro6	org.Pt.eg.db	Entrez Gene ID	chr
31	fruitfly	Drosophila melanogaster	7227	TxDb.Dmelanogaster.UCSC.dm3.ensGene	Ensembl gene ID	dm3	org.Dm.eg.db	Entrez Gene ID	chr
32	fruitfly	Drosophila melanogaster	7227	TxDb.Dmelanogaster.UCSC.dm6.ensGene	Ensembl gene ID	dm6	org.Dm.eg.db	Entrez Gene ID	chr
33	zebrafish	Danio rerio	7955	TxDb.Drerio.UCSC.danRer10.refGene	Entrez Gene ID	danRer10	org.Dr.eg.db	Entrez Gene ID	chr
34	zebrafish	Danio rerio	7955	TxDb.Drerio.UCSC.danRer11.refGene	Entrez Gene ID	danRer11	org.Dr.eg.db	Entrez Gene ID	chr
35	cattle	Bos taurus	9913	TxDb.Btaurus.UCSC.bosTau8.refGene	Entrez Gene ID	bosTau8	org.Bt.eg.db	Entrez Gene ID	chr
36	cattle	Bos taurus	9913	TxDb.Btaurus.UCSC.bosTau9.refGene	Entrez Gene ID	bosTau9	org.Bt.eg.db	Entrez Gene ID	chr
37	thale cress	Arabidopsis thaliana	3702	TxDb.Athaliana.BioMart.plantsmart51	TAIR ID	araTha	org.At.tair.db	TAIR ID

For these packages, users only need to specify the “genome version”:

great(gr, "BP", "mm10")
great(gr, "BP", "rn7")

Use MSigDB gene sets

The msigdbr package provides gene sets by mapping to several other organisms.

library(msigdbr)
msigdbr_species()

## # A tibble: 20 × 2
##    species_name                    species_common_name                          
##    <chr>                           <chr>                                        
##  1 Anolis carolinensis             Carolina anole, green anole                  
##  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
##  3 Caenorhabditis elegans          NA                                           
##  4 Canis lupus familiaris          dog, dogs                                    
##  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
##  6 Drosophila melanogaster         fruit fly                                    
##  7 Equus caballus                  domestic horse, equine, horse                
##  8 Felis catus                     cat, cats, domestic cat                      
##  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
## 10 Homo sapiens                    human                                        
## 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica           gray short-tailed opossum                    
## 13 Mus musculus                    house mouse, mouse                           
## 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes                 chimpanzee                                   
## 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
## 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
## 18 Schizosaccharomyces pombe 972h- NA                                           
## 19 Sus scrofa                      pig, pigs, swine, wild boar                  
## 20 Xenopus tropicalis              tropical clawed frog, western clawed frog

h_gene_sets = msigdbr(species = "chimpanzee", category = "H")
head(h_gene_sets)

## # A tibble: 6 × 18
##   gs_cat gs_subcat gs_name               gene_symbol entrez_gene ensembl_gene   
##   <chr>  <chr>     <chr>                 <chr>             <int> <chr>          
## 1 H      ""        HALLMARK_ADIPOGENESIS ABCA1            464630 ENSPTRG0000002…
## 2 H      ""        HALLMARK_ADIPOGENESIS ABCB8            463892 ENSPTRG0000001…
## 3 H      ""        HALLMARK_ADIPOGENESIS ACAA2            455414 ENSPTRG0000001…
## 4 H      ""        HALLMARK_ADIPOGENESIS ACADL            459914 ENSPTRG0000001…
## 5 H      ""        HALLMARK_ADIPOGENESIS ACADM            469356 ENSPTRG0000000…
## 6 H      ""        HALLMARK_ADIPOGENESIS ACADS            742921 ENSPTRG0000000…
## # ℹ 12 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>, taxon_id <int>,
## #   ortholog_sources <chr>, num_ortholog_sources <dbl>

Convert from the original format to a list of gene vectors.

h_gene_sets = split(h_gene_sets$entrez_gene, h_gene_sets$gs_name)
h_gene_sets = lapply(h_gene_sets, as.character)  # just to make sure gene IDs are all in character.
h_gene_sets = lapply(h_gene_sets, function(x) unique(x[!is.na(x)])) # remove NA and duplicated genes
h_gene_sets[1:2]

## $HALLMARK_ADIPOGENESIS
##   [1] "464630"    "463892"    "455414"    "459914"    "469356"    "742921"   
##   [7] "454672"    "104003784" "454895"    "451866"    "737339"    "471032"   
##  [13] "451742"    "737305"    "100615914" "456723"    "107967644" "454362"   
##  [19] "464334"    "743667"    "741867"    "449586"    "100614256" "741708"   
##  [25] "459164"    "746692"    "473976"    "452433"    "468889"    "745443"   
##  [31] "460926"    "455644"    "451116"    "454684"    "744890"    "461229"   
##  [37] "740513"    "104005232" "463949"    "469319"    "748673"    "450673"   
##  [43] "468605"    "471455"    "456837"    "464611"    "452659"    "472079"   
##  [49] "452307"    "454118"    "100616508" "465727"    "742828"    "737945"   
##  [55] "107976794" "746229"    "472893"    "456557"    "457056"    "747265"   
##  [61] "736777"    "464460"    "451393"    "745691"    "454512"    "466780"   
##  [67] "463861"    "744984"    "452566"    "457117"    "747936"    "459360"   
##  [73] "461436"    "464353"    "464074"    "466651"    "451984"    "456243"   
##  [79] "464255"    "467738"    "466732"    "461244"    "456929"    "460520"   
##  [85] "450562"    "450738"    "464140"    "459670"    "452976"    "471703"   
##  [91] "741876"    "471135"    "461424"    "459828"    "452295"    "460113"   
##  [97] "453565"    "741179"    "747276"    "470423"    "451967"    "450290"   
## [103] "473975"    "460157"    "462946"    "449638"    "738797"    "456076"   
## [109] "451807"    "464031"    "739986"    "459173"    "460872"    "463484"   
## [115] "462853"    "739167"    "457477"    "742027"    "746245"    "472764"   
## [121] "747387"    "744096"    "101057233" "744811"    "463686"    "744435"   
## [127] "468748"    "451175"    "460227"    "454744"    "739996"    "450735"   
## [133] "454478"    "457929"    "738397"    "458602"    "456908"    "451591"   
## [139] "450310"    "107970333" "465012"    "463481"    "460178"    "470365"   
## [145] "742092"    "741184"    "459094"    "459374"    "456940"    "745779"   
## [151] "454531"    "737918"    "107973114" "742100"    "470420"    "468499"   
## [157] "467657"    "100608935" "462416"    "451281"    "470281"    "452359"   
## [163] "456862"    "456526"    "747462"    "474051"    "456155"    "458647"   
## [169] "744390"    "455841"    "459096"    "459031"    "450574"    "449637"   
## [175] "450628"    "470477"    "471247"    "453405"    "739128"    "454681"   
## [181] "464707"    "470417"    "450933"    "459685"    "460443"    "468406"   
## [187] "458803"    "467151"    "464550"    "745004"    "451416"    "735808"   
## [193] "743144"    "460348"    "107974864" "471631"    "741897"    "463489"   
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "454210"    "461523"    "450363"    "100609296" "459646"    "740898"   
##   [7] "466415"    "450170"    "465345"    "456984"    "744209"    "449497"   
##  [13] "100608992" "459361"    "741390"    "468208"    "748142"    "473220"   
##  [19] "748205"    "736543"    "454593"    "747004"    "454579"    "747123"   
##  [25] "462689"    "460323"    "740071"    "450128"    "469524"    "449512"   
##  [31] "748272"    "470617"    "451584"    "742330"    "451585"    "450124"   
##  [37] "100615583" "473802"    "460569"    "745293"    "462191"    "740560"   
##  [43] "470892"    "470900"    "735755"    "470426"    "460577"    "465021"   
##  [49] "465607"    "736196"    "457127"    "453745"    "457277"    "738275"   
##  [55] "471200"    "459634"    "457770"    "469142"    "463415"    "466216"   
##  [61] "458797"    "453714"    "469204"    "750603"    "100615835" "740028"   
##  [67] "451695"    "451158"    "471510"    "738331"    "469584"    "739516"   
##  [73] "465940"    "461906"    "468521"    "457003"    "472959"    "457020"   
##  [79] "467610"    "461873"    "452825"    "460623"    "463280"    "746195"   
##  [85] "750725"    "100608816" "449592"    "471979"    "471977"    "471974"   
##  [91] "462591"    "462540"    "494187"    "450196"    "474132"    "473965"   
##  [97] "449517"    "747276"    "470077"    "743102"    "472749"    "469657"   
## [103] "736204"    "460816"    "471723"    "455851"    "449564"    "737808"   
## [109] "449644"    "739011"    "744277"    "450200"    "461472"    "456370"   
## [115] "450884"    "470203"    "101059843" "449565"    "454005"    "463288"   
## [121] "464245"    "745517"    "463371"    "470524"    "462386"    "450927"   
## [127] "454294"    "454045"    "458607"    "735556"    "464979"    "450156"   
## [133] "738375"    "456715"    "471734"    "736309"    "744486"    "459682"   
## [139] "745667"    "472771"    "462888"    "748652"    "449582"    "458294"   
## [145] "460699"    "459239"    "741196"    "460720"    "100322885" "469743"   
## [151] "455026"    "740704"    "740477"    "450512"    "453993"    "456276"   
## [157] "743176"    "748032"    "457607"    "462249"    "464277"    "737451"   
## [163] "746600"    "737526"    "456065"    "461536"    "107966305" "746721"   
## [169] "737070"    "459209"    "451169"    "450503"    "461971"    "461023"   
## [175] "459834"    "100610925" "471978"    "746399"    "746814"    "456060"   
## [181] "457742"    "451611"    "107971092" "461167"    "471325"    "471374"   
## [187] "471167"    "494186"    "748737"    "464876"    "741922"    "745141"   
## [193] "452125"    "453161"    "743187"    "459427"

Remember to pick the correct genome version.

great(gr, h_gene_sets, "panTro6")

Other organisms who have txdb object on AnnotationHub

For some other organisms, although they don’t have a txdb package on Bioconductor, but they have a txdb object on AnnotationHub:

library(AnnotationHub)

## Loading required package: BiocFileCache

## Loading required package: dbplyr

## 
## Attaching package: 'AnnotationHub'

## The following object is masked from 'package:Biobase':
## 
##     cache

ah = AnnotationHub()
query(ah, "TxDb")

## AnnotationHub with 511 records
## # snapshotDate(): 2024-10-28
## # $dataprovider: UCSC, FungiDB, TriTrypDB, PlasmoDB, MicrosporidiaDB, ToxoDB...
## # $species: Homo sapiens, Rattus norvegicus, Macaca mulatta, Gallus gallus, ...
## # $rdataclass: TxDb, SQLiteFile, FaFile, ChainFile
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH52245"]]' 
## 
##              title                                     
##   AH52245  | TxDb.Athaliana.BioMart.plantsmart22.sqlite
##   AH52246  | TxDb.Athaliana.BioMart.plantsmart25.sqlite
##   AH52247  | TxDb.Athaliana.BioMart.plantsmart28.sqlite
##   AH52248  | TxDb.Btaurus.UCSC.bosTau8.refGene.sqlite  
##   AH52249  | TxDb.Celegans.UCSC.ce11.refGene.sqlite    
##   ...        ...                                       
##   AH114096 | TxDb.Mmusculus.UCSC.mm39.refGene.sqlite   
##   AH116719 | TxDb.Hsapiens.UCSC.hg38.refGene.sqlite    
##   AH116720 | TxDb.Mmusculus.UCSC.mm39.refGene.sqlite   
##   AH117076 | TxDb.Hsapiens.UCSC.hg38.knownGene.sqlite  
##   AH117077 | TxDb.Mmusculus.UCSC.mm39.knownGene.sqlite

But it seems many of those organisms are microbes. I will not introduce it here.

Other organisms who do not have txdb objects

Then we need to obtain the TSS definition manually. Obtaining gene sets for an organism is relatively easy, but it is sometimes difficult to find a proper genome version because for these “rare” genomes, their genome data evolves very fast.

There are mainly two sources to obtain TSS definitions: 1. from NCBI, 2. from Ensembl.

Get TSS from NCBI

Currently some service from NCBI is not available, so we skip this and the next section.

We need to first know the “accession number” of an organism’s genome. Taking dolphin as an example, its web page on NCBI Genome is https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_011762595.1/. In rGREAT, there is a getGenomeDataFromNCBI() function which can automatically retrieve gene definitions from NCBI.

genes = getGenomeDataFromNCBI("GCF_011762595.1", return_granges = TRUE)
genes

As we have already introduced, retrieve GO gene sets for an organism is easy:

library(AnnotationHub)
ah = AnnotationHub()
orgdb = ah[["AH112418"]]

library(GSEAtraining)
gs = get_GO_gene_sets_from_orgdb(orgdb, "BP")
gs[1:2]

Note genes is set to the argument tss_source:

great(gr, gs, tss_source = genes)

Remember to double check the format of chromosome names in gr and in genes.

KEGG

KEGG also links to NCBI genomes, but the genome version for an organism is not always synchronized to NCBI. The function getKEGGGenome() returns the accession number of the genome recorded on KEGG, and this accession number can be directly used in getGenomeDataFromNCBI().

getKEGGPathways() is a helper function which downloads KEGG pathways.

In the following example, "mgp" is the organism code for Meleagris gallopavo (turkey).

genes = getGenomeDataFromNCBI(getKEGGGenome("mgp"), return_granges = TRUE)
head(genes)
gene_sets = getKEGGPathways("mgp")
gene_sets[1:2]

great(gr, gene_sets, genes)

Use BioMart GO gene sets

The BioMartGOGeneSets package not only contains GO gene sets for > 700 organisms, but also the gene definitions. great() has already integrate BioMartGOGeneSets and you only need to specify the biomart_dataset argument.

Note, the genes from biomart may not come from the genome with the same version as on NCBI, normally older than NCBI. You need to double check the genome version on biomart and in your gr.

great(gr, "GO:BP", biomart_dataset = "amelanoleuca_gene_ensembl")

Practice

Practice 1

Run local GREAT analysis with a list of TFBS from UCSC table browser:

clade: Mammal
genome: Human
assembly: GRCh37/hg19
group: Regulation
track: ENCODE 3 TFBS
table: 22Rv1 ZFX (encTfChipRkENCFF445OFX)

In the “output format” field, select “BED - browser extensible data”, then click the button “get output”. In the next page, click the button “get BED”.

Use 1. the GO BP gene sets and 2. KEGG pathways.

Zuguang Gu z.gu@dkfz.de

2025-03-02