vignettes/topic1_07_more_gene_sets.Rmd
topic1_07_more_gene_sets.Rmd
Besides those well-studied organisms, there are also resources for less well-studies organisms, mainly for KEGG and GO gene sets.
KEGG pathway supports a huge number of organisms, just try different organism code:
read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ptr"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pps"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ggo"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pon"), sep = "\t")
...
The BioMartGOGeneSets package has compiled GO gene sets for > 700 organisms using the biomart web service (with the biomaRt package.
To obtain the gene sets, use the function getBioMartGOGeneSets()
. Also you need to provide the “dataset”. The name of the dataset can be obtained by supportedOrganisms()
or from https://bioconductor.org/packages/devel/data/annotation/vignettes/BioMartGOGeneSets/inst/doc/supported_organisms.html. Let’s get GO gene sets for mouse.
library(BioMartGOGeneSets)
lt = getBioMartGOGeneSets("mmusculus_gene_ensembl")
length(lt)
## [1] 15791
lt[1]
## $`GO:0000002`
## [1] "ENSMUSG00000019699" "ENSMUSG00000028982" "ENSMUSG00000022615"
## [4] "ENSMUSG00000039264" "ENSMUSG00000030557" "ENSMUSG00000030879"
## [7] "ENSMUSG00000032449" "ENSMUSG00000038084" "ENSMUSG00000022889"
## [10] "ENSMUSG00000041064" "ENSMUSG00000033845" "ENSMUSG00000107283"
## [13] "ENSMUSG00000027424" "ENSMUSG00000036875" "ENSMUSG00000038225"
## [16] "ENSMUSG00000022292" "ENSMUSG00000004069" "ENSMUSG00000039176"
## [19] "ENSMUSG00000025209" "ENSMUSG00000029911" "ENSMUSG00000020718"
## [22] "ENSMUSG00000030978" "ENSMUSG00000026496" "ENSMUSG00000028893"
## [25] "ENSMUSG00000002814" "ENSMUSG00000026365" "ENSMUSG00000015337"
## [28] "ENSMUSG00000034203" "ENSMUSG00000059552" "ENSMUSG00000055660"
## [31] "ENSMUSG00000028455" "ENSMUSG00000030314" "ENSMUSG00000020697"
## [34] "ENSMUSG00000036923" "ENSMUSG00000032633" "ENSMUSG00000029167"
The variable lt
is a list of vectors where each vector corresponds to a GO gene set with Ensembl IDs as gene identifiers.
You can try the following command and see what will be printed:
lt = getBioMartGOGeneSets("mouse")
In getBioMartGOGeneSets()
, argument as_table
can be set to TRUE
, then the function returns a data frame.
tb = getBioMartGOGeneSets("mmusculus_gene_ensembl", as_table = TRUE)
head(tb)
## go_geneset ensembl_gene
## 1 GO:0000002 ENSMUSG00000019699
## 2 GO:0000002 ENSMUSG00000028982
## 3 GO:0000002 ENSMUSG00000022615
## 4 GO:0000002 ENSMUSG00000039264
## 5 GO:0000002 ENSMUSG00000030557
## 6 GO:0000002 ENSMUSG00000030879
Argument ontology
controls which category of GO gene sets. Possible values should be "BP"
, "CC"
and "MF"
.
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "BP") # the default one
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "CC")
getBioMartGOGeneSets("mmusculus_gene_ensembl", ontology = "MF")
Last, argument gene_id_type
can be set to "entrez_gene"
or "gene_symbol"
, then genes in the gene sets are in Entrez IDs or gene symbols. Note this depends on specific organisms, that not every organism supports Entrez IDs or gene symbols.
lt = getBioMartGOGeneSets("mmusculus_gene_ensembl", gene_id_type = "entrez_gene")
lt[1]
## $`GO:0000002`
## [1] "23797" "70556" "72962" "83408" "17258" "27397" "192287" "74143"
## [9] "27393" "208084" "27395" "17527" "74528" "327762" "408022" "382985"
## [17] "83945" "18975" "226153" "381760" "50776" "20133" "11545" "230784"
## [25] "21975" "12628" "13804" "72170" "22059" "76781" "66592" "74244"
## [33] "16882" "216021" "216805" "19017"
lt = getBioMartGOGeneSets("mmusculus_gene_ensembl", gene_id_type = "gene_symbol")
lt[1]
## $`GO:0000002`
## [1] "Akt3" "Slc25a33" "Tymp" "Gimap3" "Mef2a" "Mrpl17"
## [7] "Slc25a36" "Opa1" "Mrpl39" "Pif1" "Mrpl15" "Mpv17"
## [13] "Mgme1" "Dna2" "Primpol" "Rrm2b" "Dnaja3" "Polg"
## [19] "Twnk" "Ssbp1" "Polg2" "Rrm1" "Parp1" "Sesn2"
## [25] "Top3a" "Cfh" "Endog" "Chchd4" "Trp53" "Mettl4"
## [31] "Stoml2" "Atg7" "Lig3" "Stox1" "Flcn" "Ppargc1a"
Biocondutor core team maintaines org.*.db packages for 18 organisms
Package | Organism | Package | Organism |
---|---|---|---|
org.Hs.eg.db |
Human | org.Ss.eg.db |
Pig |
org.Mm.eg.db |
Mouse | org.Gg.eg.db |
Chicken |
org.Rn.eg.db |
Rat | org.Mmu.eg.db |
Rhesus monkey |
org.Dm.eg.db |
Fruit fly | org.Cf.eg.db |
Canine |
org.At.tair.db |
Arabidopsis | org.EcK12.eg.db |
E coli strain K12 |
org.Sc.sgd.db |
Yeast | org.Xl.eg.db |
African clawed frog |
org.Dr.eg.db |
Zebrafish | org.Ag.eg.db |
Malaria mosquito |
org.Ce.eg.db |
Nematode | org.Pt.eg.db |
Chimpanzee |
org.Bt.eg.db |
Bovine | org.EcSakai.eg.db |
E coli strain Sakai |
These org.*.db packages can be used in the same way as what we have seen with the org.Hs.eg.db package.
On AnnotationHub, there are OrgDb
objects for a huge number of other organisms (~2000) which can be used for getting GO genes and mappings between gene IDs.
library(AnnotationHub)
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: BiocFileCache
## Loading required package: dbplyr
ah = AnnotationHub()
To search for an organism, using its latin name is more suggested. Also add the “OrgDb” (data class) keyword:
## AnnotationHub with 29 records
## # snapshotDate(): 2023-04-25
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Viverra suricatta, Vermicularia truncata, Tursiops truncatus, Tr...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH111712"]]'
##
## title
## AH111712 | org.Catostomus_texanus.eg.sqlite
## AH111807 | org.Medicago_truncatula.eg.sqlite
## AH111810 | org.Felis_catus.eg.sqlite
## AH111812 | org.Felis_silvestris_catus.eg.sqlite
## AH111921 | org.Suricata_suricatta.eg.sqlite
## ... ...
## AH113351 | org.Indicator_indicator.eg.sqlite
## AH113366 | org.Truncatella_angustata.eg.sqlite
## AH113367 | org.Truncatella_truncata.eg.sqlite
## AH113440 | org.Catarrhactes_adeliae.eg.sqlite
## AH113528 | org.Polistes_fuscatus.eg.sqlite
## AnnotationHub with 1 record
## # snapshotDate(): 2023-04-25
## # names(): AH111810
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Felis catus
## # $rdataclass: OrgDb
## # $rdatadateadded: 2023-04-24
## # $title: org.Felis_catus.eg.sqlite
## # $description: NCBI gene ID based annotations about Felis catus
## # $taxonomyid: 9685
## # $genome: NCBI genomes
## # $sourcetype: NCBI/UniProt
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/p...
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation")
## # retrieve record with 'object[["AH111810"]]'
Besides using query()
to search for the AnnotationHub dataset, you can also use the BiocHubsShiny package to interactively searching for datasets:
Now we download the data.
org_db = ah[["AH111810"]] # using `[[` downloads the dataset
## loading from cache
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:AnnotationHub':
##
## cache
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
org_db
is an OrgDb
object but contains less information than the org.*.db packages:
org_db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
##
## Please see: help('select') for usage information
columns(org_db)
## [1] "ACCNUM" "ALIAS" "CHR" "ENSEMBL" "ENTREZID"
## [6] "EVIDENCE" "EVIDENCEALL" "GENENAME" "GID" "GO"
## [11] "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PMID" "REFSEQ"
## [16] "SYMBOL"
You can obtain the GO gene sets manually by select()
, taking the "ENTREZID"
and "GOALL"
columns:
all_genes = keys(org_db, keytype = "ENTREZID")
tb = select(org_db, keys = all_genes, keytype = "ENTREZID", columns = c("GOALL", "ONTOLOGYALL"))
## 'select()' returned 1:many mapping between keys and columns
head(tb)
## ENTREZID GOALL ONTOLOGYALL
## 1 100037403 <NA> <NA>
## 2 100037404 GO:0016339 BP
## 3 100037404 GO:0050900 BP
## 4 100037404 GO:0007155 BP
## 5 100037404 GO:0008150 BP
## 6 100037404 GO:0009987 BP
You may need to clean the mapping table:
With an OrgDb
object, you can also use the helper function get_GO_gene_sets_from_orgdb()
in GSEAtraining package:
library(GSEAtraining)
lt = get_GO_gene_sets_from_orgdb(org_db, "BP")
## 'select()' returned 1:many mapping between keys and columns
lt[1]
## $`GO:0000002`
## [1] "101081509" "101083159" "101083771" "101086314" "101086515" "101086671"
## [7] "101088958" "101093964" "101097343"
Try to obtain GO (BP) gene sets for dolphin (latin name: Tursiops truncatus) from BioMartGOGeneSets and AnnotationHub. Compare the difference of the GO gene sets from these two sources, e.g. compare the number of annotated genes from the two sources.
Get GO gene sets from biomart. First search the “dataset name” with supportedOrganisms()
.
gs1 = getBioMartGOGeneSets("ttruncatus_gene_ensembl", "BP", gene_id_type = "entrez_id")
Get GO gene sets from NCBI/AnnotationHub.
## AnnotationHub with 1 record
## # snapshotDate(): 2023-04-25
## # names(): AH112418
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Tursiops truncatus
## # $rdataclass: OrgDb
## # $rdatadateadded: 2023-04-24
## # $title: org.Tursiops_truncatus.eg.sqlite
## # $description: NCBI gene ID based annotations about Tursiops truncatus
## # $taxonomyid: 9739
## # $genome: NCBI genomes
## # $sourcetype: NCBI/UniProt
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/p...
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation")
## # retrieve record with 'object[["AH112418"]]'
orgdb = ah[["AH112418"]]
## loading from cache
gs2 = get_GO_gene_sets_from_orgdb(orgdb, "BP")
## 'select()' returned 1:many mapping between keys and columns
cn = intersect(names(gs1), names(gs2))
n1 = sapply(gs1, length)
n2 = sapply(gs2, length)
plot(n1[cn], n2[cn], xlab = "getBioMartGOGeneSets", ylab = "AnnotationHub")
Actually, the GO gene sets from getBioMartGOGeneSets()
/Ensembl is from an older genome build for dolphin. So it is more suggested to use the data from AnnotationHub/NCBI.