Topic 1-06: GO/KEGG gene sets for other organisms

Besides those well-studied organisms, there are also resources for less well-studies organisms, mainly for KEGG and GO gene sets.

KEGG pathways

KEGG pathway supports a huge number of organisms. You just need to find the corresponding organism code:

read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ptr"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pps"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/ggo"), sep = "\t")
read.table(url("https://rest.kegg.jp/list/pathway/pon"), sep = "\t")
...

Or use the helper function from GSEAtopics:

load_kegg_genesets(organism)

OrgDb packages

Biocondutor core team maintaines org.*.db packages for 18 organisms.

Package	Organism	Package	Organism
`org.Hs.eg.db`	Human	`org.Ss.eg.db`	Pig
`org.Mm.eg.db`	Mouse	`org.Gg.eg.db`	Chicken
`org.Rn.eg.db`	Rat	`org.Mmu.eg.db`	Rhesus monkey
`org.Dm.eg.db`	Fruit fly	`org.Cf.eg.db`	Canine
`org.At.tair.db`	Arabidopsis	`org.EcK12.eg.db`	E coli strain K12
`org.Sc.sgd.db`	Yeast	`org.Xl.eg.db`	African clawed frog
`org.Dr.eg.db`	Zebrafish	`org.Ag.eg.db`	Malaria mosquito
`org.Ce.eg.db`	Nematode	`org.Pt.eg.db`	Chimpanzee
`org.Bt.eg.db`	Bovine	`org.EcSakai.eg.db`	E coli strain Sakai

These org.*.db packages can be used in the same way as what we have seen with the org.Hs.eg.db package.

select(org.*.db, keys = keys(org.*.db), columns = c("GOALL", "ONTOLOGYALL"))
select(org.*.db, keys = "BP", keytype = "ONTOLOGYALL", columns = c("ENTREZID", "GOALL"))
as.list(org.*.egGO2ALLEGS)

Note if the primary ID type is not EntreZ ID, the name of the object might be different.

AnnotationHub

On AnnotationHub, there are OrgDb objects for a huge number of other organisms (~2000) which can be used for getting GO genes and mappings between gene IDs.

library(AnnotationHub)
ah = AnnotationHub()

To search for an organism, using its latin name is more suggested. Also add the “OrgDb” (data class) keyword:

query(ah, c("cat", "OrgDb"))

## AnnotationHub with 27 records
## # snapshotDate(): 2024-10-28
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Viverra suricatta, Vermicularia truncata, Tursiops truncatus, Tr...
## # $rdataclass: OrgDb
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH117430"]]' 
## 
##              title                               
##   AH117430 | org.Catostomus_texanus.eg.sqlite    
##   AH117534 | org.Medicago_truncatula.eg.sqlite   
##   AH117537 | org.Felis_catus.eg.sqlite           
##   AH117539 | org.Felis_silvestris_catus.eg.sqlite
##   AH117661 | org.Suricata_suricatta.eg.sqlite    
##   ...        ...                                 
##   AH118997 | org.Vermicularia_truncata.eg.sqlite 
##   AH119133 | org.Cuculus_indicator.eg.sqlite     
##   AH119134 | org.Indicator_indicator.eg.sqlite   
##   AH119157 | org.Truncatella_angustata.eg.sqlite 
##   AH119158 | org.Truncatella_truncata.eg.sqlite

query(ah, c("Felis catus", "OrgDb"))

## AnnotationHub with 1 record
## # snapshotDate(): 2024-10-28
## # names(): AH117537
## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
## # $species: Felis catus
## # $rdataclass: OrgDb
## # $rdatadateadded: 2024-10-04
## # $title: org.Felis_catus.eg.sqlite
## # $description: NCBI gene ID based annotations about Felis catus
## # $taxonomyid: 9685
## # $genome: NCBI genomes
## # $sourcetype: NCBI/UniProt
## # $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/p...
## # $sourcesize: NA
## # $tags: c("NCBI", "Gene", "Annotation") 
## # retrieve record with 'object[["AH117537"]]'

Besides using query() to search for the AnnotationHub dataset, you can also use the BiocHubsShiny package to interactively searching for datasets, or the ah_shiny() function from GSEAtopics.

Now we download the data.

# It is annoying that ID changes between different package versions
org_db = ah[["AH117537"]]  # using `[[` downloads the dataset

org_db is an OrgDb object but contains less information than the org.*.db packages:

org_db

## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi

columns(org_db)

##  [1] "ACCNUM"      "ALIAS"       "CHR"         "ENSEMBL"     "ENTREZID"   
##  [6] "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"         "GO"         
## [11] "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"     
## [16] "SYMBOL"

You can obtain the GO gene sets manually by select(), taking the "ENTREZID" and "GOALL" columns:

all_genes = keys(org_db, keytype = "ENTREZID")
tb = select(org_db, keys = all_genes, keytype = "ENTREZID", 
    columns = c("GOALL", "ONTOLOGYALL"))
head(tb)

##    ENTREZID      GOALL ONTOLOGYALL
## 1 100037403 GO:0007165          BP
## 2 100037403 GO:0007268          BP
## 3 100037403 GO:0034220          BP
## 4 100037403 GO:0042391          BP
## 5 100037403 GO:0050877          BP
## 6 100037403 GO:0007154          BP

You may need to clean the mapping table:

tb = tb[!is.na(tb$GOALL), ]
tb = unique(tb)

With an OrgDb object, you can also use the helper function load_go_genesets() in GSEAtopics package:

library(GSEAtopics)
lt = load_go_genesets(org_db, "BP")

Practice

Practice 1

Try to obtain GO (BP) gene sets for dolphin (latin name: Tursiops truncatus) from AnnotationHub, and KEGG pathways for giant panda (latin name: Ailuropoda melanoleuca)

qu = query(ah, c("Tursiops truncatus", "OrgDb"))
orgdb = ah[[ qu$ah_id ]]
gs = load_go_genesets(orgdb, "BP")

lt = load_kegg_genesets("aml")

Zuguang Gu z.gu@dkfz.de

2025-05-31

KEGG pathways

OrgDb packages

AnnotationHub

Practice

Practice 1