Topic 1-06: GO/KEGG gene sets for other organisms
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic1_06_more_gene_sets.Rmd
topic1_06_more_gene_sets.RmdBesides those well-studied organisms, there are also resources for less well-studies organisms, mainly for KEGG and GO gene sets.
KEGG pathways
KEGG pathway supports a huge number of organisms. You just need to find the corresponding organism code:
read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ptr"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pps"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ggo"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pon"), sep = "\t")
...Or use the helper function from GSEAtopics:
load_kegg_genesets(organism)OrgDb packages
Biocondutor core team maintaines org.*.db packages for 18 organisms.
| Package | Organism | Package | Organism |
|---|---|---|---|
org.Hs.eg.db |
Human | org.Ss.eg.db |
Pig |
org.Mm.eg.db |
Mouse | org.Gg.eg.db |
Chicken |
org.Rn.eg.db |
Rat | org.Mmu.eg.db |
Rhesus monkey |
org.Dm.eg.db |
Fruit fly | org.Cf.eg.db |
Canine |
org.At.tair.db |
Arabidopsis | org.EcK12.eg.db |
E coli strain K12 |
org.Sc.sgd.db |
Yeast | org.Xl.eg.db |
African clawed frog |
org.Dr.eg.db |
Zebrafish | org.Ag.eg.db |
Malaria mosquito |
org.Ce.eg.db |
Nematode | org.Pt.eg.db |
Chimpanzee |
org.Bt.eg.db |
Bovine | org.EcSakai.eg.db |
E coli strain Sakai |
These org.*.db packages can be used in the same way as what we have seen with the org.Hs.eg.db package.
select(org.*.db, keys = keys(org.*.db), columns = c("GOALL", "ONTOLOGYALL"))
select(org.*.db, keys = "BP", keytype = "ONTOLOGYALL", columns = c("ENTREZID", "GOALL"))
as.list(org.*.egGO2ALLEGS)Note if the primary ID type is not EntreZ ID, the name of the object might be different.
AnnotationHub
On AnnotationHub, there are OrgDb
objects for a huge number of other organisms (~2000) which can be used
for getting GO genes and mappings between gene IDs.
library(AnnotationHub)
ah = AnnotationHub()To search for an organism, using its latin name is more suggested. Also add the “OrgDb” (data class) keyword:
## AnnotationHub with 27 records
## # snapshotDate(): 2024-10-28
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Viverra suricatta, Vermicularia truncata, Tursiops truncatus, Tr...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## # rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["AH117430"]]'
##
## title
## AH117430 | org.Catostomus_texanus.eg.sqlite
## AH117534 | org.Medicago_truncatula.eg.sqlite
## AH117537 | org.Felis_catus.eg.sqlite
## AH117539 | org.Felis_silvestris_catus.eg.sqlite
## AH117661 | org.Suricata_suricatta.eg.sqlite
## ... ...
## AH118997 | org.Vermicularia_truncata.eg.sqlite
## AH119133 | org.Cuculus_indicator.eg.sqlite
## AH119134 | org.Indicator_indicator.eg.sqlite
## AH119157 | org.Truncatella_angustata.eg.sqlite
## AH119158 | org.Truncatella_truncata.eg.sqlite
## AnnotationHub with 1 record
## # snapshotDate(): 2024-10-28
## # names(): AH117537
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Felis catus
## # $rdataclass: OrgDb
## # $rdatadateadded: 2024-10-04
## # $title: org.Felis_catus.eg.sqlite
## # $description: NCBI gene ID based annotations about Felis catus
## # $taxonomyid: 9685
## # $genome: NCBI genomes
## # $sourcetype: NCBI/UniProt
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/p...
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation")
## # retrieve record with 'object[["AH117537"]]'
Besides using query() to search for the AnnotationHub
dataset, you can also use the BiocHubsShiny package to
interactively searching for datasets, or the ah_shiny()
function from GSEAtopics.
Now we download the data.
# It is annoying that ID changes between different package versions
org_db = ah[["AH117537"]] # using `[[` downloads the datasetorg_db is an OrgDb object but contains less
information than the org.*.db packages:
org_db## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
columns(org_db)## [1] "ACCNUM" "ALIAS" "CHR" "ENSEMBL" "ENTREZID"
## [6] "EVIDENCE" "EVIDENCEALL" "GENENAME" "GID" "GO"
## [11] "GOALL" "ONTOLOGY" "ONTOLOGYALL" "PMID" "REFSEQ"
## [16] "SYMBOL"
You can obtain the GO gene sets manually by select(),
taking the "ENTREZID" and "GOALL" columns:
all_genes = keys(org_db, keytype = "ENTREZID")
tb = select(org_db, keys = all_genes, keytype = "ENTREZID",
columns = c("GOALL", "ONTOLOGYALL"))
head(tb)## ENTREZID GOALL ONTOLOGYALL
## 1 100037403 GO:0007165 BP
## 2 100037403 GO:0007268 BP
## 3 100037403 GO:0034220 BP
## 4 100037403 GO:0042391 BP
## 5 100037403 GO:0050877 BP
## 6 100037403 GO:0007154 BP
You may need to clean the mapping table:
With an OrgDb object, you can also use the helper
function load_go_genesets() in GSEAtopics
package:
library(GSEAtopics)
lt = load_go_genesets(org_db, "BP")Practice
Practice 1
Try to obtain GO (BP) gene sets for dolphin (latin name: Tursiops truncatus) from AnnotationHub, and KEGG pathways for giant panda (latin name: Ailuropoda melanoleuca)
qu = query(ah, c("Tursiops truncatus", "OrgDb"))
orgdb = ah[[ qu$ah_id ]]
gs = load_go_genesets(orgdb, "BP")
lt = load_kegg_genesets("aml")