Topic 1-08: Gene ID conversion

Gene ID conversion is a very common task in gene set enrichment analysis. We take org.Hs.eg.db (for human) as an example.

library(org.Hs.eg.db)

Note: it is the same for the OrgDb objects for other orgainsms on AnnotationHub.

Use the select() interface

We need the following three types of information:

keys: Gene IDs in one ID type;
keytypes: The name of the input ID type;
columns: The name of the output ID type;

To get the valid name of ID types:

keytypes(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

columns(org.Hs.eg.db)

##  [1] "ACCNUM"       "ALIAS"        "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS"
##  [6] "ENTREZID"     "ENZYME"       "EVIDENCE"     "EVIDENCEALL"  "GENENAME"    
## [11] "GENETYPE"     "GO"           "GOALL"        "IPI"          "MAP"         
## [16] "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"         "PFAM"        
## [21] "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"       "UCSCKG"      
## [26] "UNIPROT"

1:1 mapping

For example, we want to convert the following two genes into Entrez IDs.

genes = c("TP53", "MDM2")

use select(). The following function call can be read as “select ‘ENTREZID’ for the genes where their ‘SYMBOL’ are in ‘gene’”.

map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENTREZID")
map

##   SYMBOL ENTREZID
## 1   TP53     7157
## 2   MDM2     4193

Then we can create a named vector for the conversion:

map_vec = structure(map$ENTREZID, names = map$SYMBOL)
map_vec[genes]

##   TP53   MDM2 
## "7157" "4193"

What if we convert them to Ensembl gene IDs:

map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENSEMBL")
map

##   SYMBOL         ENSEMBL
## 1   TP53 ENSG00000141510
## 2   MDM2 ENSG00000135679

If you want to map to multiple ID types:

select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = c("ENTREZID", "ENSEMBL"))

##   SYMBOL ENTREZID         ENSEMBL
## 1   TP53     7157 ENSG00000141510
## 2   MDM2     4193 ENSG00000135679

use mapIds() which is very similar to select().

Note the argument is named column instead of columns, so you can only map to one gene ID type.

mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENTREZID")

##   TP53   MDM2 
## "7157" "4193"

mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENSEMBL")

##              TP53              MDM2 
## "ENSG00000141510" "ENSG00000135679"

1:many mapping

Now there might be some problems if the mapping is not 1:1.

genes = c("TP53", "MMD2")
map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = "ENTREZID")
map

##   SYMBOL  ENTREZID
## 1   TP53      7157
## 2   MMD2    221938
## 3   MMD2 100505381

Usually it is hard to pick one unique gene for such 1:mapping case, but we can add an additional column “GENETYPE” when querying:

map = select(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", columns = c("ENTREZID", "GENETYPE"))
map

##   SYMBOL  ENTREZID       GENETYPE
## 1   TP53      7157 protein-coding
## 2   MMD2    221938 protein-coding
## 3   MMD2 100505381        unknown

For “MMD2”, adding the “GENETYPE” column works because the second hit of it is annotated to “unknown”. We can simply remove it.

map[map$GENETYPE == "protein-coding", ]

##   SYMBOL ENTREZID       GENETYPE
## 1   TP53     7157 protein-coding
## 2   MMD2   221938 protein-coding

And it is always a good idea to only include protein-coding genes in gene set enrichment analysis.

When you run mapIds(), if there are multiple mappings, by default it only selects the first one, which may cause problem if the first mapping is not protein-coding.

mapIds(org.Hs.eg.db, keys = genes, keytype = "SYMBOL", column = "ENTREZID")

##     TP53     MMD2 
##   "7157" "221938"

Use the pre-generated objects

In org.*.db, there are also pre-generated objects that already contains mapping between EntreZ IDs to a specific gene ID type.

ls(envir = asNamespace("org.Hs.eg.db"))

##  [1] "datacache"                "org.Hs.eg"               
##  [3] "org.Hs.eg.db"             "org.Hs.egACCNUM"         
##  [5] "org.Hs.egACCNUM2EG"       "org.Hs.egALIAS2EG"       
##  [7] "org.Hs.egCHR"             "org.Hs.egCHRLENGTHS"     
##  [9] "org.Hs.egCHRLOC"          "org.Hs.egCHRLOCEND"      
## [11] "org.Hs.egENSEMBL"         "org.Hs.egENSEMBL2EG"     
## [13] "org.Hs.egENSEMBLPROT"     "org.Hs.egENSEMBLPROT2EG" 
## [15] "org.Hs.egENSEMBLTRANS"    "org.Hs.egENSEMBLTRANS2EG"
## [17] "org.Hs.egENZYME"          "org.Hs.egENZYME2EG"      
## [19] "org.Hs.egGENENAME"        "org.Hs.egGENETYPE"       
## [21] "org.Hs.egGO"              "org.Hs.egGO2ALLEGS"      
## [23] "org.Hs.egGO2EG"           "org.Hs.egMAP"            
## [25] "org.Hs.egMAP2EG"          "org.Hs.egMAPCOUNTS"      
## [27] "org.Hs.egOMIM"            "org.Hs.egOMIM2EG"        
## [29] "org.Hs.egORGANISM"        "org.Hs.egPATH"           
## [31] "org.Hs.egPATH2EG"         "org.Hs.egPFAM"           
## [33] "org.Hs.egPMID"            "org.Hs.egPMID2EG"        
## [35] "org.Hs.egPROSITE"         "org.Hs.egREFSEQ"         
## [37] "org.Hs.egREFSEQ2EG"       "org.Hs.egSYMBOL"         
## [39] "org.Hs.egSYMBOL2EG"       "org.Hs.egUCSCKG"         
## [41] "org.Hs.egUNIPROT"         "org.Hs.eg_dbInfo"        
## [43] "org.Hs.eg_dbconn"         "org.Hs.eg_dbfile"        
## [45] "org.Hs.eg_dbschema"

The following six objects can be used to convert between major gene ID types:

org.Hs.egENSEMBL: Entrez -> Ensembl
org.Hs.egENSEMBL2EG: Ensembl -> Entrez
org.Hs.egREFSEQ: Entrez -> RefSeq
org.Hs.egREFSEQ2EG: RefSeq -> Entrez
org.Hs.egSYMBOL: Entrez -> Symbol
org.Hs.egSYMBOL2EG: Symbol -> Entrez

org.Hs.egSYMBOL

## SYMBOL map for Human (object of class "AnnDbBimap")

If you have a single gene, you can use [[:

org.Hs.egSYMBOL2EG[["TP53"]]

## [1] "7157"

If you have multiple genes, use [ + as.list():

lt = as.list(org.Hs.egSYMBOL2EG)
lt[genes]

## $TP53
## [1] "7157"
## 
## $MMD2
## [1] "221938"    "100505381"

# or
as.list(org.Hs.egSYMBOL2EG[genes])

## $TP53
## [1] "7157"
## 
## $MMD2
## [1] "221938"    "100505381"

But you have to face the same problem of multiple mappings.

The strorage type of Entrez IDs

This may cause problems silently to your analysis. Let’s say, we have the following six genes with their Entrez IDs and symbols.

Entrez IDs are represented as integers. If the corresponding data is imported from text tables, the column of Entrez IDs will be automatically saved as integers.

eg = c(1, 3, 6, 8, 10, 37)
symbol = c("gene1", "gene2", "gene3", "gene4", "gene5", "gene6")

Now, we want to convert from Entrez IDs to gene symbols, we normally generate a “map” variable where EntreZ IDs are names:

map = symbol
names(map) = eg
map

##       1       3       6       8      10      37 
## "gene1" "gene2" "gene3" "gene4" "gene5" "gene6"

And we give map the Entrez IDs are indices and we expect it returns the corresponding symbols: However, if Entrez IDs are saved as integers, there will be a problem:

unname(map[eg])

## [1] "gene1" "gene3" "gene6" NA      NA      NA

The problem is there are both character indicies and integer indices for a vector. In the prevoius usage, we expect eg is used as character indices, but in fact they are used as integer indices. We have to explicitely convert eg to characters.

unname(map[as.character(eg)])

## [1] "gene1" "gene2" "gene3" "gene4" "gene5" "gene6"

On the other hand, eg as integers is only used in the NCBI database, but on our user-side. It is recommended to save them as characters, because they are more like a “name representation” of genes.

Conclusion

If the mappig is 1:1, all the methods mentioned are OK. When the mapping is 1:many, use select() + filtering by the “GENETYPE” column is safer.

In the GSEAtopics package, there is a map_to_entrez() which generates a global mapping variable to EntreZ IDs for protein-coding genes.

library(GSEAtopics)
# symbol to EntreZ ID
map = map_to_entrez("SYMBOL", org.Hs.eg.db)
head(map)

##     A1BG      A2M     NAT1     NAT2 SERPINA3    AADAC 
##      "1"      "2"      "9"     "10"     "12"     "13"

Practice

Practice 1

Using org.Hs.eg.db, compare the following three types of gene IDs: SYMBOL, ENTREZID and ENSEMBL. For two types denoted as $T_1$ and $T_2$ , i.e. for the conversion $T_1 \rightarrow T_2$ , calculated the fraction of IDs in $T_1$ that have multiple mappings. Do the comparison for every combination of $T_1$ and $T_2$ and also do it only restricted in protein-coding genes.

t = c("SYMBOL", "ENTREZID", "ENSEMBL")
for(i in 1:3) {
    for(j in 1:3) {
        if(i == j) next
        cat("conversion ", t[i], " -> ", t[j], ": ", sep = "")

        all_ids = keys(org.Hs.eg.db, keytype = t[i])
        suppressMessages(map <- select(org.Hs.eg.db, keys = all_ids, keytype = t[i], columns = t[j]))
        tb = table(map[[1]])
        p = sum(tb > 1)/length(tb)
        cat(p, "\n")
    }
}

## conversion SYMBOL -> ENTREZID: 8.799081e-05 
## conversion SYMBOL -> ENSEMBL: 0.0130951 
## conversion ENTREZID -> SYMBOL: 0 
## conversion ENTREZID -> ENSEMBL: 0.01308812 
## conversion ENSEMBL -> SYMBOL: 0.01731188 
## conversion ENSEMBL -> ENTREZID: 0.01731188

And only restricted in protein-coding genes:

for(i in 1:3) {
    for(j in 1:3) {
        if(i == j) next
        cat("conversion ", t[i], " -> ", t[j], ": ", sep = "")

        all_ids = keys(org.Hs.eg.db, keytype = t[i])
        suppressMessages(map <- select(org.Hs.eg.db, keys = all_ids, keytype = t[i], columns = c(t[j], "GENETYPE")))
        map = map[map$GENETYPE == "protein-coding", ]
        tb = table(map[[1]])
        p = sum(tb > 1)/length(tb)
        cat(p, "\n")
    }
}

## conversion SYMBOL -> ENTREZID: 0 
## conversion SYMBOL -> ENSEMBL: 0.08519417 
## conversion ENTREZID -> SYMBOL: 0 
## conversion ENTREZID -> ENSEMBL: 0.08519417 
## conversion ENSEMBL -> SYMBOL: 0.01828566 
## conversion ENSEMBL -> ENTREZID: 0.01828566

Two reasons for the multiple mapping between SYMBOL/ENTREZ and ENSEMBL

Gene alleles are recorded in ENSENBL, but only the main/curated gene is listed on NCBI.
New (unnamed) genes which are very similar as gene $G$ are all listed on NCBI, but they are all assigned to the same IDs as $G$ on Ensembl.

Zuguang Gu z.gu@dkfz.de

2025-05-31