Topic 1-07: Generate gene sets for other organisms by mapping to orthologues
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic1_07_ortholog.Rmd
topic1_07_ortholog.RmdIn the Bioconductor ecosystem, there are already a huge number of gene set resources for a large number of organisms. However, there can still be cases where a poor-studied organism does not have a well-annotated gene set resource. Human, of course, is the most studied organism and it has the richest annotation resources. For other organisms, we can built similar gene sets by mapping to orthologues to human genes.
The hallmark gene sets is a useful resource for exploring the biological functions. But natively, MSigDB only provides data for human and mouse, but still a large number of other organisms are missing. In the following part of this section, I will demonstrate how to construct hallmark gene sets for the giant panda.
The Orthology.eg.db package
Let me first demonstrate how to obtain the mappings from human genes to panda genes. On Biocoductor, there is a standard package Orthology.eg.db which provides orthologue mappings for hundreds of organisms. Let’s first load the package.
library(Orthology.eg.db)Similarly, there is a database object Orthology.eg.db
which has the same name as the package.
Orthology.eg.db## OrthologyDb object:
## | Db type: OrthologyDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ORTHOLOGY_DB
## | EGSOURCEDATE: 2023-Mar05
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | DBSCHEMAVERSION: 2.1
The orthology database can be thought as a table where columns are
different organisms, and rows are groups of orthologues genes. All
supported organisms can be obtained by the function
columns().
## [1] 508
head(cl)## [1] "Acanthisitta.chloris" "Acanthochromis.polyacanthus"
## [3] "Acanthopagrus.latus" "Accipiter.gentilis"
## [5] "Acinonyx.jubatus" "Acipenser.ruthenus"
The column names correspond to organism’s Latin names, separated by a
dot character. Remember to use the select()interface, we
also need to check which column can be used for the keys. In
Orthology.eg.db, all columns can be used as key columns,
which means, we can map between any two organisms. We can valiadte it
by:
## [1] TRUE
We check whether the panda organism is supported in
Orthology.eg.db by searching its Latin name
"Ailuropoda melanoleuca".
kt[grep("Ailuropoda", kt, ignore.case = TRUE)]## [1] "Ailuropoda.melanoleuca"
Yes, it is supported. For converting between two types of IDs, I will first create a global mapping vector, where human genes are the names of the vector and panda genes are the values.
I first extract all the human genes, which will be the primary keys
for the orthology mapping. Remember the value for keytype argument
should be a valid value in keytypes(Orthology.eg.db).
## [1] 1 2 9 10 12 13
length(keys)## [1] 19561
The keys are in EntreZ ID type.
Now we can apply select() function on the
Orthology.eg.db database object and generate the
mapping.
map_df = select(Orthology.eg.db, keys,
columns = "Ailuropoda.melanoleuca", keytype = "Homo.sapiens")
head(map_df)## Homo.sapiens Ailuropoda.melanoleuca
## 1 1 100482878
## 2 2 NA
## 3 9 NA
## 4 10 NA
## 5 12 100481925
## 6 13 100468737
Now map_df is a two-column data frame where the first
column includes the genes for human and the second column includes genes
for panda. If there is no orthologue mapping, the corresponding value
for panda is NA. We see the gene IDs are integers, but what
is exactly the ID type of it? We have to go through the documentation of
the Orthology.eg.db package, it tells the internal IDs are Entrez
IDs.
Danger zone: Entrez IDs should not be saved as integers!
Before we move on, we have to double check how Entrez IDs are saved
in map_df.
class(map_df[, 1])## [1] "integer"
class(map_df[, 2])## [1] "integer"
Unfortunately they are in integer mode, we have to convert them to characters.
map_df[, 1] = as.character(map_df[, 1])
map_df[, 2] = as.character(map_df[, 2])Now we can safely constrcut the mapping vector where human genes are names and panda genes are values.
map_vec = structure(map_df[, 2], names = map_df[, 1])
# or we can do it in two lines
map_vec = map_df[, 2]
names(map_vec) = map_df[, 1]
head(map_vec)## 1 2 9 10 12 13
## "100482878" NA NA NA "100481925" "100468737"
We have the mappings from human to panda, now we can construct the
hallmark gene sets for panda. First let’s obtain the hallmark gene sets
for human. The get_msigdb() function is from the
GSEAtopics package.
library(GSEAtopics)
gs_human = get_msigdb(version = "2024.1.Hs", collection = "h.all")Next we perform the conversion by providing the human genes are
character indicies. Note if the human gene is not mapped to panda gene,
the corresponding value will be NA, which we need to remove
it. Using the unname() function is optional. The reason is
x2 is constructed from map_vec which has names
attached. Here I removed the name attributes of x2 to make
gs_panda easy to read.
gs_panda = lapply(gs_human, function(x) {
x2 = map_vec[x]
unname(unique(x2[!is.na(x2)])) # perhaps also apply unique()
})And finally to remove empty gene sets. Now gs_panda
contains hallmark gene sets for panda.
gs_panda = gs_panda[sapply(gs_panda, length) > 0]
gs_panda[1:2]## $HALLMARK_ADIPOGENESIS
## [1] "100483394" "100467221" "100476344" "100467341" "100480578" "100483719"
## [7] "100467494" "100479955" "100471606" "100476922" "109488363" "100481319"
## [13] "100473600" "100476840" "100471854" "100475159" "100483231" "100478103"
## [19] "100464955" "100475013" "100470064" "100477661" "100475955" "100478341"
## [25] "100467218" "100484596" "100478407" "100473683" "100484116" "100465691"
## [31] "100480054" "100472549" "100473949" "100470190" "100472500" "100481463"
## [37] "100483649" "100463982" "100470950" "100479936" "100464675" "100483793"
## [43] "100482840" "100471225" "100475944" "100478752" "100467925" "100478122"
## [49] "100471241" "100479884" "100475526" "100475627" "100465139" "100464724"
## [55] "100463800" "100480854" "100473628" "100479670" "100476910" "100481266"
## [61] "100479095" "100467495" "100471989" "100477433" "100479347" "100472325"
## [67] "100474044" "100478586" "100473571" "100471547" "100467523" "100477228"
## [73] "100465578" "100464202" "100483572" "100473687" "100469740" "100475353"
## [79] "100470081" "100468497" "100470379" "100468331" "100477654" "100479358"
## [85] "100471171" "100468728" "100483093" "100482626" "100471953" "100473120"
## [91] "100484549" "100466232" "100480395" "100463910" "100480163" "100482520"
## [97] "100483295" "100473768" "100470553" "100475280" "100468697" "100478658"
## [103] "100470630" "100472419" "100474605" "100463863" "100479605" "100472770"
## [109] "100482804" "105238479" "100484030" "100464608" "100464508" "100465272"
## [115] "100483500" "100475941" "100472911" "100477762" "100474896" "100478484"
## [121] "100481263" "100472251" "100469925" "100477501" "100477724" "100480436"
## [127] "100465245" "105241326" "100469480" "100471478" "100479273" "100464074"
## [133] "100479861" "100475139" "100483714" "100467763" "100478217" "100474540"
## [139] "100465011" "100482938" "100467132" "100465583" "100478555" "100466891"
## [145] "100470302" "100480537" "100482490" "100465465" "100471603" "100466261"
## [151] "105238785" "100476977" "100479709" "100483477" "100472435" "100479054"
## [157] "100476123" "100473802" "100484067" "100469567" "100473290" "100480530"
## [163] "100464327" "100471124" "100467163" "100471492" "100475614" "100466386"
## [169] "100465252" "100473253" "100477509" "100470895" "100465294" "100476369"
## [175] "100466768" "100479654" "100464478" "100480186" "100484661" "100484543"
## [181] "100476136" "100484798" "100468480" "100472838" "100477702" "100473348"
## [187] "100484841" "100479434" "100478686" "100483229" "100465735" "100464004"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "100474675" "100477383" "100465106" "100482126" "100472026" "100465793"
## [7] "100475168" "100470489" "100465389" "100473529" "100478429" "100480891"
## [13] "100474163" "100477480" "100466311" "100481549" "100470232" "100463753"
## [19] "100471517" "100479467" "100484065" "100478428" "100484298" "100477902"
## [25] "100484718" "100482462" "100484468" "100482689" "100476345" "100467917"
## [31] "100480521" "100475502" "100475218" "100471626" "100480227" "100483382"
## [37] "100483271" "100484695" "100472499" "100481747" "100467898" "100467411"
## [43] "100472520" "100483259" "105234669" "100469185" "100475682" "100483900"
## [49] "100478672" "100484183" "105235090" "100478049" "100467045" "100467465"
## [55] "100465360" "100473811" "100472504" "100464489" "100469300" "100472871"
## [61] "105240499" "100473767" "100468235" "100467541" "100467791" "100464552"
## [67] "100466565" "100481771" "100475920" "100481622" "100476014" "100479773"
## [73] "100468902" "100474777" "100464636" "100477284" "100464393" "100478290"
## [79] "105241843" "100476475" "100480624" "100482679" "100462659" "100473768"
## [85] "100481672" "100463682" "100471679" "100481671" "100478943" "100470471"
## [91] "100477772" "100474774" "100474520" "100484790" "100467740" "100484109"
## [97] "100476941" "100480874" "100480862" "100475725" "100466247" "100478292"
## [103] "100468126" "100482473" "100476095" "100481194" "105237562" "100463717"
## [109] "100484001" "100481889" "100474086" "100469809" "100464009" "100484264"
## [115] "100475688" "100481360" "105234734" "100471453" "100477981" "100463836"
## [121] "100483416" "100467628" "100479649" "100479606" "100466506" "100470792"
## [127] "100482608" "100475429" "100482185" "100476115" "100468228" "100471786"
## [133] "100476162" "100468792" "100470060" "100464343" "100473675" "100478007"
## [139] "100484719" "100468826" "100473539" "100474611" "100479579" "100480506"
## [145] "100478926" "100479482" "100466931" "100471128" "100484671" "100464779"
## [151] "100476391" "100480183" "100469980" "100476314" "100476695" "100481666"
## [157] "100484430" "100474026" "100471596" "100464886" "100478040" "100484836"
## [163] "100470703" "100463584" "100465131" "100476205" "100472707" "100465129"
## [169] "100468855" "100472456" "100471563" "100473222" "100474087" "100467951"
## [175] "100479205" "100476065" "100472174" "100479480" "100473238"
In GSEAtopics, there is already a helper function
gs_map_to_orthologues(). The gene IDs in the source gene
sets must be Entrez IDs.
gs_rat = gs_map_to_orthologues(gs_human, from = "Homo.sapiens",
to = "Rattus.norvegicus")
gs_rat[1:2]## $HALLMARK_ADIPOGENESIS
## [1] "313210" "362302" "170465" "25287" "24158" "64304"
## [7] "24159" "79250" "50681" "25289" "100361007" "246253"
## [13] "312670" "294324" "83533" "24184" "29539" "24189"
## [19] "89807" "362850" "64312" "25728" "64363" "29308"
## [25] "298757" "25390" "192241" "304601" "25244" "312682"
## [31] "303836" "24232" "24248" "287710" "316384" "29157"
## [37] "64315" "295629" "54238" "361824" "309361" "291541"
## [43] "310201" "298410" "298069" "290905" "29309" "304542"
## [49] "498909" "25282" "303393" "171335" "25413" "311849"
## [55] "170587" "300047" "24307" "29611" "29318" "117543"
## [61] "84497" "64191" "299135" "287380" "81654" "298942"
## [67] "24908" "290370" "362011" "64526" "140547" "297342"
## [73] "171402" "84050" "65030" "293701" "29579" "292845"
## [79] "79451" "29383" "64558" "305240" "25112" "288333"
## [85] "290596" "29653" "290843" "25062" "64845" "64317"
## [91] "29328" "79563" "113965" "301384" "113906" "24479"
## [97] "114096" "25179" "116465" "312444" "81008" "684327"
## [103] "29491" "502872" "309816" "25608" "81680" "25330"
## [109] "362434" "24539" "114097" "170920" "294972" "81829"
## [115] "24552" "29254" "289197" "296623" "288271" "297799"
## [121] "171451" "295922" "288057" "363227" "25488" "293453"
## [127] "361385" "295923" "305751" "259274" "83717" "24614"
## [133] "64031" "25511" "64460" "117276" "25741" "24645"
## [139] "171434" "114209" "64534" "298199" "29441" "25664"
## [145] "24667" "304799" "64371" "58842" "500199" "24929"
## [151] "64192" "360571" "364838" "362835" "246250" "246298"
## [157] "361293" "311328" "100364162" "306873" "140945" "300111"
## [163] "25073" "25541" "298596" "289217" "29723" "292657"
## [169] "29743" "170943" "94172" "170551" "298906" "64347"
## [175] "24786" "686098" "503306" "25434" "691966" "500364"
## [181] "296655" "114597" "83783" "83688" "252961" "64524"
## [187] "170842" "25274" "50522" "114590" "311864" "54315"
## [193] "685322" "690848" "301011" "497902" "89811" "56010"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "292023" "361390" "79249" "83817" "29263" "24185"
## [7] "29722" "24223" "29592" "83477" "680611" "497672"
## [13] "24231" "297339" "29131" "362506" "117551" "64033"
## [19] "25193" "25109" "497761" "25300" "25660" "315609"
## [25] "300678" "24932" "171369" "84349" "303747" "25599"
## [31] "100913063" "25408" "56822" "24930" "24931" "498079"
## [37] "299314" "300649" "78965" "315707" "50654" "498335"
## [43] "246759" "84475" "116483" "304775" "24329" "292148"
## [49] "362952" "691947" "298573" "287444" "302811" "59325"
## [55] "24356" "29251" "25439" "246097" "25385" "289211"
## [61] "79113" "293860" "499537" "79214" "64043" "289437"
## [67] "299242" "266708" "288077" "687001" "29560" "294274"
## [73] "294273" "24984" "365542" "294269" "499415" "686326"
## [79] "25712" "116465" "360697" "171064" "84351" "25325"
## [85] "171040" "84405" "64546" "171333" "116553" "25670"
## [91] "116996" "29197" "373540" "24494" "116562" "288905"
## [97] "25704" "25746" "140924" "287287" "25084" "24498"
## [103] "25647" "116558" "29200" "25196" "291100" "293624"
## [109] "292060" "308995" "309684" "363577" "24514" "300250"
## [115] "313050" "155918" "60584" "361795" "499800" "291359"
## [121] "81515" "313121" "292763" "64668" "81687" "300974"
## [127] "305606" "500904" "300955" "117547" "287362" "24599"
## [133] "25498" "360918" "50669" "25023" "24681" "291983"
## [139] "116689" "24699" "287191" "362491" "25347" "287122"
## [145] "29257" "29287" "29288" "103689992" "500449" "252971"
## [151] "500616" "366126" "56782" "116696" "100363145" "25124"
## [157] "367264" "24811" "24812" "25217" "59086" "81809"
## [163] "24832" "116510" "305354" "310553" "364594" "305353"
## [169] "24835" "294900" "311786" "498075" "361831" "116725"
## [175] "314442" "317371" "301348"
Practice
Practice 1
Construct hallmark gene sets for dolphin (latin name: Tursiops truncatus)
First check whether it is supported in Orthology.eg.db.
kt[grep("Tursiops", kt, ignore.case = TRUE)]## [1] "Tursiops.truncatus"
map_df = select(Orthology.eg.db, keys,
columns = "Tursiops.truncatus", keytype = "Homo.sapiens")
map_df[, 1] = as.character(map_df[, 1])
map_df[, 2] = as.character(map_df[, 2])
map_vec = structure(map_df[, 2], names = map_df[, 1])
gs_dolphin = lapply(gs_human, function(x) {
x2 = map_vec[x]
unname(x2[!is.na(x2)])
})
gs_dolphin = gs_dolphin[sapply(gs_dolphin, length) > 0]
gs_dolphin[1:2]## $HALLMARK_ADIPOGENESIS
## [1] "101324261" "101331892" "101333071" "101321535" "101339921" "117307625"
## [7] "101333508" "101338166" "101338715" "101337662" "101337434" "101324186"
## [13] "101339119" "101315601" "101333510" "101324917" "101318798" "117308060"
## [19] "101331188" "117311769" "101336246" "117308978" "101334408" "117313583"
## [25] "101337204" "101322713" "101331966" "101320909" "117309210" "101319712"
## [31] "101317605" "101338084" "101331217" "101333092" "101318581" "101336723"
## [37] "101331721" "101336369" "101321426" "101320458" "101327681" "101328124"
## [43] "101329295" "101322991" "101323341" "101320029" "101333781" "101325542"
## [49] "101332916" "101318829" "101338610" "101335861" "101331598" "101326966"
## [55] "117308749" "101325838" "101321389" "101332530" "117308690" "101332074"
## [61] "101323928" "101330785" "101334301" "101321942" "101335654" "101338928"
## [67] "101335930" "101334607" "101321860" "101334218" "101327108" "101317181"
## [73] "101331943" "101330375" "101321280" "101318459" "117311150" "101319836"
## [79] "101338608" "101324500" "101340172" "101323442" "101337802" "101339046"
## [85] "101330605" "101335592" "101338890" "117311850" "101338602" "101326990"
## [91] "101316447" "101334850" "101328969" "101321334" "101324223" "101327594"
## [97] "101327173" "117314225" "101332137" "101332387" "101326295" "101331166"
## [103] "117313539" "101335303" "117309227" "101324084" "101337343" "101329342"
## [109] "101318463" "101337284" "101336162" "101326342" "101337956" "101336374"
## [115] "101332376" "101329906" "101321797" "101329039" "101322039" "101319682"
## [121] "101321364" "101329593" "101330807" "117311742" "101319163" "101318892"
## [127] "101328287" "101318749" "101322527" "101319551" "117309367" "101332305"
## [133] "101325875" "101339425" "101337346" "101331169" "101324967" "117314139"
## [139] "101327440" "101332835" "101326018" "101322234" "101334941" "101316073"
## [145] "101326873" "101329793" "101335522" "101332488" "101333678" "117311841"
## [151] "117311779" "101333929" "101337766" "101333557" "101335961" "101316913"
## [157] "101337558" "101323493" "101327629" "101332633" "101315682" "101329811"
## [163] "101336956" "101329389" "101333152" "101327260" "101331047" "101336065"
## [169] "101322982" "101337059" "101337718" "101338733" "101331474" "101333013"
## [175] "101320464" "101332754" "101316418" "101330490" "101327745" "101337563"
## [181] "101318801" "101331812" "101328440" "101326471" "101316129" "101333717"
## [187] "101322601" "101318867" "117311820" "101328687" "101316570" "109549330"
## [193] "117308030"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "101316006" "101322461" "101325827" "117308055" "101337652" "101334655"
## [7] "101320863" "117311091" "101338911" "101335912" "101332324" "101319443"
## [13] "117313830" "117307933" "101332498" "117312582" "101323189" "101337792"
## [19] "109552529" "101332929" "101318861" "101337386" "101324372" "101318685"
## [25] "101320520" "101318977" "101319269" "101319080" "101330521" "101337561"
## [31] "101338006" "101334285" "101322889" "117308993" "101316111" "101328068"
## [37] "101325833" "109551019" "101319896" "117312596" "117310434" "101328686"
## [43] "101325006" "101318184" "101338838" "101333230" "101331685" "101329574"
## [49] "101317829" "101338156" "101330476" "101325906" "101324454" "101325246"
## [55] "101321725" "101337526" "101317832" "101324567" "101316077" "101317421"
## [61] "101315588" "101321590" "101336517" "101336680" "101316660" "101336991"
## [67] "101315855" "101317113" "101337909" "101338980" "101324613" "101339595"
## [73] "101324632" "101335518" "101335803" "117313882" "101336672" "101336089"
## [79] "117311758" "101315573" "101330507" "101337436" "101327594" "101336209"
## [85] "101330540" "101327715" "101330184" "117309240" "101325702" "101338510"
## [91] "101339814" "101338603" "101319832" "101333811" "101334331" "101331381"
## [97] "101318809" "101325136" "101324556" "101323898" "101330858" "117310158"
## [103] "101338293" "101316092" "101329901" "101322100" "101331699" "101316054"
## [109] "101318917" "101335564" "117313261" "101337877" "101321073" "101324233"
## [115] "101323565" "101334791" "117314288" "101333191" "101317602" "101318472"
## [121] "117313815" "109551474" "101339042" "101319418" "101320162" "101321493"
## [127] "101329372" "101330701" "101316407" "101340051" "101338757" "101327068"
## [133] "101318929" "101338302" "101338542" "101335049" "101331937" "117308498"
## [139] "101318466" "101328304" "117309120" "101323322" "101334480" "101322158"
## [145] "101332255" "101329453" "117308176" "101336040" "117308963" "101323062"
## [151] "101339204" "101329800" "101328107" "101331087" "101321759" "101329195"
## [157] "101326362" "101326807" "101322203" "101328796" "101335034" "101325082"
## [163] "101327220" "101327029" "101315499" "117313367" "101334886" "101319774"
## [169] "101338049" "101339193" "101320069" "101318178" "101334069" "101319749"
## [175] "101334010" "101319013" "101326041" "101329104" "101326409" "101320092"
Or
gs_dolphin = gs_map_to_orthologues(gs_human, from = "Homo.sapiens",
to = "Tursiops.truncatus")