Topic 1-08: Gene ID conversion
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic1_08_gene_id.Rmd
topic1_08_gene_id.RmdGene ID conversion is a very common task in gene set enrichment analysis. We take org.Hs.eg.db (for human) as an example.
library(org.Hs.eg.db)Note: it is the same for the OrgDb objects for other
orgainsms on AnnotationHub.
Use the select() interface
We need the following three types of information:
-
keys: Gene IDs in one ID type; -
keytypes: The name of the input ID type; -
columns: The name of the output ID type;
To get the valid name of ID types:
keytypes(org.Hs.eg.db)## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
columns(org.Hs.eg.db)## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GENETYPE" "GO" "GOALL" "IPI" "MAP"
## [16] "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM"
## [21] "PMID" "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
## [26] "UNIPROT"
1:1 mapping
For example, we want to convert the following two genes into Entrez IDs.
genes = c("TP53", "MDM2")- use
select(). The following function call can be read as “select ‘ENTREZID’ for the genes where their ‘SYMBOL’ are in ‘gene’”.
map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENTREZID")
map## SYMBOL ENTREZID
## 1 TP53 7157
## 2 MDM2 4193
Then we can create a named vector for the conversion:
map_vec = structure(map$ENTREZID, names = map$SYMBOL)
map_vec[genes]## TP53 MDM2
## "7157" "4193"
What if we convert them to Ensembl gene IDs:
map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENSEMBL")
map## SYMBOL ENSEMBL
## 1 TP53 ENSG00000141510
## 2 MDM2 ENSG00000135679
If you want to map to multiple ID types:
## SYMBOL ENTREZID ENSEMBL
## 1 TP53 7157 ENSG00000141510
## 2 MDM2 4193 ENSG00000135679
Note the argument is named column instead of
columns, so you can only map to one gene ID type.
mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENTREZID")## TP53 MDM2
## "7157" "4193"
mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENSEMBL")## TP53 MDM2
## "ENSG00000141510" "ENSG00000135679"
1:many mapping
Now there might be some problems if the mapping is not 1:1.
genes = c("TP53", "MMD2")
map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENTREZID")
map## SYMBOL ENTREZID
## 1 TP53 7157
## 2 MMD2 221938
## 3 MMD2 100505381
Usually it is hard to pick one unique gene for such 1:mapping case, but we can add an additional column “GENETYPE” when querying:
map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = c("ENTREZID", "GENETYPE"))
map## SYMBOL ENTREZID GENETYPE
## 1 TP53 7157 protein-coding
## 2 MMD2 221938 protein-coding
## 3 MMD2 100505381 unknown
For “MMD2”, adding the “GENETYPE” column works because the second hit of it is annotated to “unknown”. We can simply remove it.
map[map$GENETYPE == "protein-coding", ]## SYMBOL ENTREZID GENETYPE
## 1 TP53 7157 protein-coding
## 2 MMD2 221938 protein-coding
And it is always a good idea to only include protein-coding genes in gene set enrichment analysis.
When you run mapIds(), if there are multiple mappings,
by default it only selects the first one, which may cause problem if the
first mapping is not protein-coding.
mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENTREZID")## TP53 MMD2
## "7157" "221938"
Use the pre-generated objects
In org.*.db, there are also pre-generated objects that already contains mapping between EntreZ IDs to a specific gene ID type.
ls(envir = asNamespace("org.Hs.eg.db"))## [1] "datacache" "org.Hs.eg"
## [3] "org.Hs.eg.db" "org.Hs.egACCNUM"
## [5] "org.Hs.egACCNUM2EG" "org.Hs.egALIAS2EG"
## [7] "org.Hs.egCHR" "org.Hs.egCHRLENGTHS"
## [9] "org.Hs.egCHRLOC" "org.Hs.egCHRLOCEND"
## [11] "org.Hs.egENSEMBL" "org.Hs.egENSEMBL2EG"
## [13] "org.Hs.egENSEMBLPROT" "org.Hs.egENSEMBLPROT2EG"
## [15] "org.Hs.egENSEMBLTRANS" "org.Hs.egENSEMBLTRANS2EG"
## [17] "org.Hs.egENZYME" "org.Hs.egENZYME2EG"
## [19] "org.Hs.egGENENAME" "org.Hs.egGENETYPE"
## [21] "org.Hs.egGO" "org.Hs.egGO2ALLEGS"
## [23] "org.Hs.egGO2EG" "org.Hs.egMAP"
## [25] "org.Hs.egMAP2EG" "org.Hs.egMAPCOUNTS"
## [27] "org.Hs.egOMIM" "org.Hs.egOMIM2EG"
## [29] "org.Hs.egORGANISM" "org.Hs.egPATH"
## [31] "org.Hs.egPATH2EG" "org.Hs.egPFAM"
## [33] "org.Hs.egPMID" "org.Hs.egPMID2EG"
## [35] "org.Hs.egPROSITE" "org.Hs.egREFSEQ"
## [37] "org.Hs.egREFSEQ2EG" "org.Hs.egSYMBOL"
## [39] "org.Hs.egSYMBOL2EG" "org.Hs.egUCSCKG"
## [41] "org.Hs.egUNIPROT" "org.Hs.eg_dbInfo"
## [43] "org.Hs.eg_dbconn" "org.Hs.eg_dbfile"
## [45] "org.Hs.eg_dbschema"
The following six objects can be used to convert between major gene ID types:
-
org.Hs.egENSEMBL: Entrez -> Ensembl -
org.Hs.egENSEMBL2EG: Ensembl -> Entrez -
org.Hs.egREFSEQ: Entrez -> RefSeq -
org.Hs.egREFSEQ2EG: RefSeq -> Entrez -
org.Hs.egSYMBOL: Entrez -> Symbol -
org.Hs.egSYMBOL2EG: Symbol -> Entrez
org.Hs.egSYMBOL## SYMBOL map for Human (object of class "AnnDbBimap")
If you have a single gene, you can use [[:
org.Hs.egSYMBOL2EG[["TP53"]]## [1] "7157"
If you have multiple genes, use [ +
as.list():
lt = as.list(org.Hs.egSYMBOL2EG)
lt[genes]## $TP53
## [1] "7157"
##
## $MMD2
## [1] "221938" "100505381"
# or
as.list(org.Hs.egSYMBOL2EG[genes])## $TP53
## [1] "7157"
##
## $MMD2
## [1] "221938" "100505381"
But you have to face the same problem of multiple mappings.
The strorage type of Entrez IDs
This may cause problems silently to your analysis. Let’s say, we have the following six genes with their Entrez IDs and symbols.
Entrez IDs are represented as integers. If the corresponding data is imported from text tables, the column of Entrez IDs will be automatically saved as integers.
Now, we want to convert from Entrez IDs to gene symbols, we normally generate a “map” variable where EntreZ IDs are names:
map = symbol
names(map) = eg
map## 1 3 6 8 10 37
## "gene1" "gene2" "gene3" "gene4" "gene5" "gene6"
And we give map the Entrez IDs are indices and we expect
it returns the corresponding symbols: However, if Entrez IDs are saved
as integers, there will be a problem:
unname(map[eg])## [1] "gene1" "gene3" "gene6" NA NA NA
The problem is there are both character indicies and integer indices
for a vector. In the prevoius usage, we expect eg is used
as character indices, but in fact they are used as integer indices. We
have to explicitely convert eg to characters.
unname(map[as.character(eg)])## [1] "gene1" "gene2" "gene3" "gene4" "gene5" "gene6"
On the other hand, eg as integers is only used in the
NCBI database, but on our user-side. It is recommended to save them as
characters, because they are more like a “name representation” of
genes.
Conclusion
If the mappig is 1:1, all the methods mentioned are OK. When the
mapping is 1:many, use select() + filtering by the
“GENETYPE” column is safer.
In the GSEAtopics package, there is a
map_to_entrez() which generates a global mapping variable
to EntreZ IDs for protein-coding genes.
library(GSEAtopics)
# symbol to EntreZ ID
map = map_to_entrez("SYMBOL", org.Hs.eg.db)
head(map)## A1BG A2M NAT1 NAT2 SERPINA3 AADAC
## "1" "2" "9" "10" "12" "13"
Practice
Practice 1
Using org.Hs.eg.db, compare the following three types of
gene IDs: SYMBOL, ENTREZID and
ENSEMBL. For two types denoted as
and
,
i.e. for the conversion
,
calculated the fraction of IDs in
that have multiple mappings. Do the comparison for every combination of
and
and also do it only restricted in protein-coding genes.
t = c("SYMBOL", "ENTREZID", "ENSEMBL")
for(i in 1:3) {
for(j in 1:3) {
if(i == j) next
cat("conversion ", t[i], " -> ", t[j], ": ", sep = "")
all_ids = keys(org.Hs.eg.db, keytype = t[i])
suppressMessages(map <- select(org.Hs.eg.db, keys = all_ids, keytype = t[i], columns = t[j]))
tb = table(map[[1]])
p = sum(tb > 1)/length(tb)
cat(p, "\n")
}
}## conversion SYMBOL -> ENTREZID: 8.799081e-05
## conversion SYMBOL -> ENSEMBL: 0.0130951
## conversion ENTREZID -> SYMBOL: 0
## conversion ENTREZID -> ENSEMBL: 0.01308812
## conversion ENSEMBL -> SYMBOL: 0.01731188
## conversion ENSEMBL -> ENTREZID: 0.01731188
And only restricted in protein-coding genes:
for(i in 1:3) {
for(j in 1:3) {
if(i == j) next
cat("conversion ", t[i], " -> ", t[j], ": ", sep = "")
all_ids = keys(org.Hs.eg.db, keytype = t[i])
suppressMessages(map <- select(org.Hs.eg.db, keys = all_ids, keytype = t[i], columns = c(t[j], "GENETYPE")))
map = map[map$GENETYPE == "protein-coding", ]
tb = table(map[[1]])
p = sum(tb > 1)/length(tb)
cat(p, "\n")
}
}## conversion SYMBOL -> ENTREZID: 0
## conversion SYMBOL -> ENSEMBL: 0.08519417
## conversion ENTREZID -> SYMBOL: 0
## conversion ENTREZID -> ENSEMBL: 0.08519417
## conversion ENSEMBL -> SYMBOL: 0.01828566
## conversion ENSEMBL -> ENTREZID: 0.01828566
Two reasons for the multiple mapping between SYMBOL/ENTREZ and ENSEMBL
- Gene alleles are recorded in ENSENBL, but only the main/curated gene is listed on NCBI.
- New (unnamed) genes which are very similar as gene are all listed on NCBI, but they are all assigned to the same IDs as on Ensembl.