In the Bioconductor ecosystem, there are already a huge number of gene set resources for a large number of organisms. However, there can still be cases where a poor-studied organism does not have a well-annotated gene set resource. Human, of course, is the most studied organism and it has the richest annotation resources. For other organisms, we can built similar gene sets by mapping to orthologues to human genes.

The hallmark gene sets is a useful resource for exploring the biological functions. But natively, MSigDB only provides data for human and mouse, but still a large number of other organisms are missing. In the following part of this section, I will demonstrate how to construct hallmark gene sets for the great panda.

The Orthology.eg.db package

Let me first demonstrate how to obtain the mapping from human genes to panda genes. In Biocoductor, there is a standard package Orthology.eg.db which provides orthologue mappings for hundreds of organisms. Let’s first load the package.

library(Orthology.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.3.2
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## 

Similarly, there is a database object Orthology.eg.db which has the same name as the package.

Orthology.eg.db
## OrthologyDb object:
## | Db type: OrthologyDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ORTHOLOGY_DB
## | EGSOURCEDATE: 2023-Mar05
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | DBSCHEMAVERSION: 2.1
## 
## Please see: help('select') for usage information

The orthology database can be thought as a table where columns are different organisms, and rows are groups of orthologues genes. All supported organisms can be obtained by the function columns().

cl = columns(Orthology.eg.db)
length(cl)
## [1] 508
head(cl)
## [1] "Acanthisitta.chloris"        "Acanthochromis.polyacanthus"
## [3] "Acanthopagrus.latus"         "Accipiter.gentilis"         
## [5] "Acinonyx.jubatus"            "Acipenser.ruthenus"

The column names correspond to organism’s Latin names, separated by a dot character. Remember to use the select()interface, we also need to check which column can be used for the keys. In Orthology.eg.db, all columns can be used as key columns, which means, we can map between any two organisms We can valiadte it by:

kt = keytypes(Orthology.eg.db)
identical(cl, kt)
## [1] TRUE

We check whether the panda organism is supported in Orthology.eg.db by searching its Latin name "Ailuropoda melanoleuca".

kt[grep("Ailuropoda", kt, ignore.case = TRUE)]
## [1] "Ailuropoda.melanoleuca"

Yes, it is supported. For converting between two types of IDs, I will first create a global mapping vector, where human genes are the names of the vector and panda genes are the values.

I first extract all the human genes, which will be the primary keys for the orthology mapping. Remember the value for keytype argument should be a valid value in keytypes(Orthology.eg.db).

keys = keys(Orthology.eg.db, keytype = "Homo.sapiens")
head(keys)
## [1]  1  2  9 10 12 13

The keys are in EntreZ ID type.

Now we can apply select() function on the Orthology.eg.db database object and generate the mapping.

map_df = select(Orthology.eg.db, keys, 
    columns = "Ailuropoda.melanoleuca", keytype = "Homo.sapiens")
head(map_df)
##   Homo.sapiens Ailuropoda.melanoleuca
## 1            1              100482878
## 2            2                     NA
## 3            9                     NA
## 4           10                     NA
## 5           12              100481925
## 6           13              100468737

Now map_df is a two-column data frame where the first column includes the genes for human and the second column includes genes for panda. If there is no orthologue mapping, the corresponding value for panda is NA. We see the gene IDs are integers, but what is exactly the ID type of it? We have to go through the documentation of the Orthology.eg.db package, it tells the internal IDs are Entrez IDs.

Danger zone: Entrez IDs should not be saved as integers!

Before we move on, we have to double check how Entrez IDs are saved in map_df.

class(map_df[, 1])
## [1] "integer"
class(map_df[, 2])
## [1] "integer"

Unfortunately they are in integer mode, we have to convert them to characters.

map_df[, 1] = as.character(map_df[, 1])
map_df[, 2] = as.character(map_df[, 2])

Now we can safely constrcut the mapping vector where human genes are names and panda genes are values.

map_vec = structure(map_df[, 2], names = map_df[, 1])
# or we can do it in two lines
map_vec = map_df[, 2]
names(map_vec) = map_df[, 1]

head(map_vec)
##           1           2           9          10          12          13 
## "100482878"          NA          NA          NA "100481925" "100468737"

We have the mappings from human to panda, now we can construct the hallmark gene sets for panda. First let’s obtain the hallmark gene sets for human. The get_msigdb() function is from the GSEAtraining package.

library(GSEAtraining)
gs_human = get_msigdb(version = "2023.2.Hs", collection = "h.all")

Next we perform the conversion by providing the human genes are character indicies. Note if the human gene is not mapped to panda gene, the corresponding value will be NA, which we need to remove it. Using the unname() function is optional. The reason is x2 is constructed from map_vec which has names attached. Here I removed the name attributes of x2 to make gs_panda easy to read.

gs_panda = lapply(gs_human, function(x) {
    x2 = map_vec[x]
    unname(x2[!is.na(x2)])
})

And finally to remove empty gene sets. Now gs_panda contains hallmark gene sets for panda.

gs_panda = gs_panda[sapply(gs_panda, length) > 0]
gs_panda[1:2]
## $HALLMARK_ADIPOGENESIS
##   [1] "100483394" "100467221" "100476344" "100467341" "100480578" "100483719"
##   [7] "100467494" "100479955" "100471606" "100476922" "109488363" "100481319"
##  [13] "100473600" "100476840" "100471854" "100475159" "100483231" "100478103"
##  [19] "100464955" "100475013" "100470064" "100477661" "100475955" "100478341"
##  [25] "100467218" "100484596" "100478407" "100473683" "100484116" "100465691"
##  [31] "100480054" "100472549" "100473949" "100470190" "100472500" "100481463"
##  [37] "100483649" "100463982" "100470950" "100479936" "100464675" "100483793"
##  [43] "100482840" "100471225" "100475944" "100478752" "100467925" "100478122"
##  [49] "100471241" "100479884" "100475526" "100475627" "100465139" "100464724"
##  [55] "100463800" "100480854" "100473628" "100479670" "100476910" "100481266"
##  [61] "100479095" "100467495" "100471989" "100477433" "100479347" "100472325"
##  [67] "100474044" "100478586" "100473571" "100471547" "100467523" "100477228"
##  [73] "100465578" "100464202" "100483572" "100473687" "100469740" "100475353"
##  [79] "100470081" "100468497" "100470379" "100468331" "100477654" "100479358"
##  [85] "100471171" "100468728" "100483093" "100482626" "100471953" "100473120"
##  [91] "100484549" "100466232" "100480395" "100463910" "100480163" "100482520"
##  [97] "100483295" "100473768" "100470553" "100475280" "100468697" "100478658"
## [103] "100470630" "100472419" "100474605" "100463863" "100479605" "100472770"
## [109] "100482804" "105238479" "100484030" "100464608" "100464508" "100465272"
## [115] "100483500" "100475941" "100472911" "100477762" "100474896" "100478484"
## [121] "100481263" "100472251" "100469925" "100477501" "100477724" "100480436"
## [127] "100465245" "105241326" "100469480" "100471478" "100479273" "100464074"
## [133] "100479861" "100475139" "100483714" "100467763" "100478217" "100474540"
## [139] "100465011" "100482938" "100467132" "100465583" "100478555" "100466891"
## [145] "100470302" "100480537" "100482490" "100465465" "100471603" "100466261"
## [151] "105238785" "100476977" "100479709" "100483477" "100472435" "100479054"
## [157] "100476123" "100473802" "100484067" "100469567" "100473290" "100480530"
## [163] "100464327" "100471124" "100467163" "100471492" "100475614" "100466386"
## [169] "100465252" "100473253" "100477509" "100470895" "100465294" "100476369"
## [175] "100466768" "100479654" "100464478" "100480186" "100484661" "100484543"
## [181] "100476136" "100484798" "100468480" "100472838" "100477702" "100473348"
## [187] "100484841" "100479434" "100478686" "100483229" "100465735" "100464004"
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "100474675" "100477383" "100465106" "100482126" "100472026" "100465793"
##   [7] "100475168" "100470489" "100465389" "100473529" "100478429" "100480891"
##  [13] "100474163" "100477480" "100466311" "100481549" "100470232" "100463753"
##  [19] "100471517" "100479467" "100484065" "100478428" "100484298" "100477902"
##  [25] "100484718" "100482462" "100484468" "100482689" "100476345" "100467917"
##  [31] "100480521" "100475502" "100475218" "100471626" "100480227" "100483382"
##  [37] "100483271" "100484695" "100472499" "100481747" "100467898" "100467411"
##  [43] "100472520" "100483259" "105234669" "100469185" "100475682" "100483900"
##  [49] "100478672" "100484183" "105235090" "100478049" "100467045" "100467465"
##  [55] "100465360" "100473811" "100472504" "100464489" "100469300" "100472871"
##  [61] "105240499" "100473767" "100468235" "100467541" "100467791" "100464552"
##  [67] "100466565" "100481771" "100475920" "100481622" "100476014" "100479773"
##  [73] "100468902" "100474777" "100464636" "100477284" "100464393" "100478290"
##  [79] "105241843" "100476475" "100480624" "100482679" "100462659" "100473768"
##  [85] "100481672" "100463682" "100471679" "100481671" "100478943" "100470471"
##  [91] "100477772" "100474774" "100474520" "100484790" "100467740" "100484109"
##  [97] "100476941" "100480874" "100480862" "100475725" "100466247" "100478292"
## [103] "100468126" "100482473" "100476095" "100481194" "105237562" "100463717"
## [109] "100484001" "100481889" "100474086" "100469809" "100464009" "100484264"
## [115] "100475688" "100481360" "105234734" "100471453" "100477981" "100463836"
## [121] "100483416" "100467628" "100479649" "100479606" "100466506" "100470792"
## [127] "100482608" "100475429" "100482185" "100476115" "100468228" "100471786"
## [133] "100476162" "100468792" "100470060" "100464343" "100473675" "100478007"
## [139] "100484719" "100468826" "100473539" "100474611" "100479579" "100480506"
## [145] "100478926" "100479482" "100466931" "100471128" "100484671" "100464779"
## [151] "100476391" "100480183" "100469980" "100476314" "100476695" "100481666"
## [157] "100484430" "100474026" "100471596" "100464886" "100478040" "100484836"
## [163] "100470703" "100463584" "100465131" "100476205" "100472707" "100465129"
## [169] "100468855" "100472456" "100471563" "100473222" "100474087" "100467951"
## [175] "100479205" "100476065" "100472174" "100479480" "100473238"

Practice

Practice 1

Construct hallmark gene sets for dolphin (latin name: Tursiops truncatus)

Solution

First check whether it is supported in Orthology.eg.db.

kt[grep("Tursiops", kt, ignore.case = TRUE)]
## [1] "Tursiops.truncatus"
map_df = select(Orthology.eg.db, keys, 
    columns = "Tursiops.truncatus", keytype = "Homo.sapiens")
map_df[, 1] = as.character(map_df[, 1])
map_df[, 2] = as.character(map_df[, 2])

map_vec = structure(map_df[, 2], names = map_df[, 1])

gs_dolphin = lapply(gs_human, function(x) {
    x2 = map_vec[x]
    unname(x2[!is.na(x2)])
})
gs_dolphin = gs_dolphin[sapply(gs_dolphin, length) > 0]
gs_dolphin[1:2]
## $HALLMARK_ADIPOGENESIS
##   [1] "101324261" "101331892" "101333071" "101321535" "101339921" "117307625"
##   [7] "101333508" "101338166" "101338715" "101337662" "101337434" "101324186"
##  [13] "101339119" "101315601" "101333510" "101324917" "101318798" "117308060"
##  [19] "101331188" "117311769" "101336246" "117308978" "101334408" "117313583"
##  [25] "101337204" "101322713" "101331966" "101320909" "117309210" "101319712"
##  [31] "101317605" "101338084" "101331217" "101333092" "101318581" "101336723"
##  [37] "101331721" "101336369" "101321426" "101320458" "101327681" "101328124"
##  [43] "101329295" "101322991" "101323341" "101320029" "101333781" "101325542"
##  [49] "101332916" "101318829" "101338610" "101335861" "101331598" "101326966"
##  [55] "117308749" "101325838" "101321389" "101332530" "117308690" "101332074"
##  [61] "101323928" "101330785" "101334301" "101321942" "101335654" "101338928"
##  [67] "101335930" "101334607" "101321860" "101334218" "101327108" "101317181"
##  [73] "101331943" "101330375" "101321280" "101318459" "117311150" "101319836"
##  [79] "101338608" "101324500" "101340172" "101323442" "101337802" "101339046"
##  [85] "101330605" "101335592" "101338890" "117311850" "101338602" "101326990"
##  [91] "101316447" "101334850" "101328969" "101321334" "101324223" "101327594"
##  [97] "101327173" "117314225" "101332137" "101332387" "101326295" "101331166"
## [103] "117313539" "101335303" "117309227" "101324084" "101337343" "101329342"
## [109] "101318463" "101337284" "101336162" "101326342" "101337956" "101336374"
## [115] "101332376" "101329906" "101321797" "101329039" "101322039" "101319682"
## [121] "101321364" "101329593" "101330807" "117311742" "101319163" "101318892"
## [127] "101328287" "101318749" "101322527" "101319551" "117309367" "101332305"
## [133] "101325875" "101339425" "101337346" "101331169" "101324967" "117314139"
## [139] "101327440" "101332835" "101326018" "101322234" "101334941" "101316073"
## [145] "101326873" "101329793" "101335522" "101332488" "101333678" "117311841"
## [151] "117311779" "101333929" "101337766" "101333557" "101335961" "101316913"
## [157] "101337558" "101323493" "101327629" "101332633" "101315682" "101329811"
## [163] "101336956" "101329389" "101333152" "101327260" "101331047" "101336065"
## [169] "101322982" "101337059" "101337718" "101338733" "101331474" "101333013"
## [175] "101320464" "101332754" "101316418" "101330490" "101327745" "101337563"
## [181] "101318801" "101331812" "101328440" "101326471" "101316129" "101333717"
## [187] "101322601" "101318867" "117311820" "101328687" "101316570" "109549330"
## [193] "117308030"
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "101316006" "101322461" "101325827" "117308055" "101337652" "101334655"
##   [7] "101320863" "117311091" "101338911" "101335912" "101332324" "101319443"
##  [13] "117313830" "117307933" "101332498" "117312582" "101323189" "101337792"
##  [19] "109552529" "101332929" "101318861" "101337386" "101324372" "101318685"
##  [25] "101320520" "101318977" "101319269" "101319080" "101330521" "101337561"
##  [31] "101338006" "101334285" "101322889" "117308993" "101316111" "101328068"
##  [37] "101325833" "109551019" "101319896" "117312596" "117310434" "101328686"
##  [43] "101325006" "101318184" "101338838" "101333230" "101331685" "101329574"
##  [49] "101317829" "101338156" "101330476" "101325906" "101324454" "101325246"
##  [55] "101321725" "101337526" "101317832" "101324567" "101316077" "101317421"
##  [61] "101315588" "101321590" "101336517" "101336680" "101316660" "101336991"
##  [67] "101315855" "101317113" "101337909" "101338980" "101324613" "101339595"
##  [73] "101324632" "101335518" "101335803" "117313882" "101336672" "101336089"
##  [79] "117311758" "101315573" "101330507" "101337436" "101327594" "101336209"
##  [85] "101330540" "101327715" "101330184" "117309240" "101325702" "101338510"
##  [91] "101339814" "101338603" "101319832" "101333811" "101334331" "101331381"
##  [97] "101318809" "101325136" "101324556" "101323898" "101330858" "117310158"
## [103] "101338293" "101316092" "101329901" "101322100" "101331699" "101316054"
## [109] "101318917" "101335564" "117313261" "101337877" "101321073" "101324233"
## [115] "101323565" "101334791" "117314288" "101333191" "101317602" "101318472"
## [121] "117313815" "109551474" "101339042" "101319418" "101320162" "101321493"
## [127] "101329372" "101330701" "101316407" "101340051" "101338757" "101327068"
## [133] "101318929" "101338302" "101338542" "101335049" "101331937" "117308498"
## [139] "101318466" "101328304" "117309120" "101323322" "101334480" "101322158"
## [145] "101332255" "101329453" "117308176" "101336040" "117308963" "101323062"
## [151] "101339204" "101329800" "101328107" "101331087" "101321759" "101329195"
## [157] "101326362" "101326807" "101322203" "101328796" "101335034" "101325082"
## [163] "101327220" "101327029" "101315499" "117313367" "101334886" "101319774"
## [169] "101338049" "101339193" "101320069" "101318178" "101334069" "101319749"
## [175] "101334010" "101319013" "101326041" "101329104" "101326409" "101320092"