Topic 1-02: Get pathways from KEGG

KEGG is a comprehensive database which not only contains pathways but also a large variaty of other information. The entrence of getting KEGG data programmatically is via the KEGG REST API.

Please note the restrictions of using KEGG data (https://www.kegg.jp/kegg/legal.html):

Academic use of KEGG

Academic users may freely use the KEGG website at https://www.kegg.jp/ or its mirror site at GenomeNet https://www.genome.jp/kegg/. Academic users who utilize KEGG for providing academic services are requested to obtain an academic service provider license, which is included in the KEGG FTP academic subscription. The KEGG FTP academic subscription, which is a paid service (see background information), may also be obtained to conveniently download the entire KEGG database.

Non-academic use of KEGG

Non-academic users must understand that KEGG is not a public database and non-academic use of KEGG generally requires a commercial license. There are two types of commercial licenses available: end user and business. The end user license includes access rights to the FTP site and the website, while the business license includes access rights to the FTP site only. Please contact Pathway Solutions for more details.

Directly read from the KEGG API

KEGG provides its data via a REST API (https://rest.kegg.jp/). There are several commands that can be used to retrieve specific types of data. The URL form of the request is

https://rest.kegg.jp/<operation>/<argument>[/<argument2[/<argument3> ...]]

To get the KEGG pathway gene sets, we will use the following two operators:

link: get the mapping between two sources.
list: get the details of a list of items.

The link operator returns the mapping between two sources of information. We can use the following command to get the mappings between genes and pathways for human.

df1 = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
head(df1)

##          V1            V2
## 1 hsa:10327 path:hsa00010
## 2   hsa:124 path:hsa00010
## 3   hsa:125 path:hsa00010
## 4   hsa:126 path:hsa00010
## 5   hsa:127 path:hsa00010
## 6   hsa:128 path:hsa00010

In the example, url() constructs a connection object that directly transfer data from the remote URL. You can also first download the output from KEGG into a local file, read it and later delete the temporary file. url() basically automates such steps.

temp = tempfile()
download.file("https://rest.kegg.jp/link/pathway/hsa", destfile = temp)
df1 = read.table(temp, sep = "\t")
file.remove(temp)

Also the URL "https://rest.kegg.jp/link/hsa/pathway" returns identical results which only switches the two columns in the table.

In the output, the first column contains Entrez ID (users may remove the "hsa:" prefix for downstream analysis) and the second columncontains KEGG pathways IDs (users may remove the "path:" previx).

df1[, 1] = gsub("hsa:", "", df1[, 1])
df1[, 2] = gsub("path:",  "", df1[, 2])
head(df1)

##      V1       V2
## 1 10327 hsa00010
## 2   124 hsa00010
## 3   125 hsa00010
## 4   126 hsa00010
## 5   127 hsa00010
## 6   128 hsa00010

To get the full name of pathways, use the list command:

df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
head(df2)

##         V1                                                     V2
## 1 hsa01100              Metabolic pathways - Homo sapiens (human)
## 2 hsa01200               Carbon metabolism - Homo sapiens (human)
## 3 hsa01210 2-Oxocarboxylic acid metabolism - Homo sapiens (human)
## 4 hsa01212           Fatty acid metabolism - Homo sapiens (human)
## 5 hsa01230     Biosynthesis of amino acids - Homo sapiens (human)
## 6 hsa01232           Nucleotide metabolism - Homo sapiens (human)

df2[, 2] = gsub(" - Homo.*$", "", df2[, 2])
head(df2)

##         V1                              V2
## 1 hsa01100              Metabolic pathways
## 2 hsa01200               Carbon metabolism
## 3 hsa01210 2-Oxocarboxylic acid metabolism
## 4 hsa01212           Fatty acid metabolism
## 5 hsa01230     Biosynthesis of amino acids
## 6 hsa01232           Nucleotide metabolism

KEGGREST

There is a KEGGREST package which implements a full interface to access KEGG data in R. All the APIs from KEGG REST service are suppoted in KEGGREST. The names of functions in KEGGREST are consistent to the operators in the KEGG API. For example, to get the mapping between genes and pathways, the function keggLink() can be used.

library(KEGGREST)
gene2pathway = keggLink("pathway", "hsa")
head(gene2pathway)

##       hsa:10327         hsa:124         hsa:125         hsa:126         hsa:127 
## "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" "path:hsa00010" 
##         hsa:128 
## "path:hsa00010"

This named vector is not commonly used for downstream gene set analysis. A more used format is a data frame. We can simply converted gene2pathway to a data frame.

df3 = data.frame(
    gene_id    = gsub("hsa:", "", names(gene2pathway)),
    pathway_id = gsub("path:", "", gene2pathway)
)
head(df3)

##   gene_id pathway_id
## 1   10327   hsa00010
## 2     124   hsa00010
## 3     125   hsa00010
## 4     126   hsa00010
## 5     127   hsa00010
## 6     128   hsa00010

Similar as the list operator in KEGG API, keggList() function can be used to retrieve pathway IDs and their names. The function returns a named vector.

head(keggList("pathway", "hsa"))

##                                                 hsa01100 
##              "Metabolic pathways - Homo sapiens (human)" 
##                                                 hsa01200 
##               "Carbon metabolism - Homo sapiens (human)" 
##                                                 hsa01210 
## "2-Oxocarboxylic acid metabolism - Homo sapiens (human)" 
##                                                 hsa01212 
##           "Fatty acid metabolism - Homo sapiens (human)" 
##                                                 hsa01230 
##     "Biosynthesis of amino acids - Homo sapiens (human)" 
##                                                 hsa01232 
##           "Nucleotide metabolism - Homo sapiens (human)"

Highlight genes on the pathway image

Users may want to highlight some genes on the pathway image, e.g. to highlight differentially expressed genes. The function color.pathway.by.objects() in KEGGREST accepts a vector of genes and assigns them with a set of colors. In the following example, I randomly pick 10 genes from the pathway “hsa04911”. It does not work any more because KEGG has changed its API.

Let’s manually color genes for a specific pathway on KEGG web site.

Open https://www.genome.jp/pathway/hsa04911.

genes = names(gene2pathway[gene2pathway == "path:hsa04911"])
diff_genes = sample(genes, 10)
settings = data.frame(
    genes = diff_genes, 
    colors = paste0(rep("orange", 10),",",rep("blue", 10)) # background,labels
)
write.table(settings, file = stdout(), row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t")

hsa:468 orange,blue
hsa:23439   orange,blue
hsa:779 orange,blue
hsa:5582    orange,blue
hsa:477 orange,blue
hsa:2767    orange,blue
hsa:6514    orange,blue
hsa:775 orange,blue
hsa:6616    orange,blue
hsa:109 orange,blue

How to set gene colors: https://www.genome.jp/kegg/webapp/color_gui.html

The package ggkegg also supports customizing KEGG graph.

clusterProfiler

The second Bioconductor pacakge clusterProfiler has a simple function download_KEGG() which accepts the prefix of a organism and returns a list of two data frames, one for the mapping between genes and pathways and the other for the full name of pathways.

lt = clusterProfiler::download_KEGG("hsa")
head(lt$KEGGPATHID2EXTID)

##       from    to
## 1 hsa00010 10327
## 2 hsa00010   124
## 3 hsa00010   125
## 4 hsa00010   126
## 5 hsa00010   127
## 6 hsa00010   128

head(lt$KEGGPATHID2NAME)

##       from                              to
## 1 hsa01100              Metabolic pathways
## 2 hsa01200               Carbon metabolism
## 3 hsa01210 2-Oxocarboxylic acid metabolism
## 4 hsa01212           Fatty acid metabolism
## 5 hsa01230     Biosynthesis of amino acids
## 6 hsa01232           Nucleotide metabolism

Note: all the methods we introduced can not be used to retrieve the network data from KEGG.

Practice

Practice 1

In the previous example, we obtained the mapping between genes and KEGG pathways for human, using the organism code of "hsa". Since human is very well-studied and everyone knows "hsa" is for human, what if you are studying dogs? Try to find out the organism code for dog and get the gene-pathway mappings for dog.

Hint: use the list operator in KEGG API and search for dog.

Solution

You can get organism codes for all organisms by https://rest.kegg.jp/list/organism. Then you will find the code for dog is "cfa". Use one of the following three ways to download the data.

df = read.table(url("https://rest.kegg.jp/link/pathway/cfa"), sep = "\t")
pathway2gene = KEGGREST::keggLink("pathway", "cfa")
lt = clusterProfiler::download_KEGG("cfa")

Practice 2

Make the distribution of the numbers of genes in KEGG pathways (for human).

There is a pathway that contains far more genes than all other pathways. What is the pathway ID of it and what is its full name? Go to the web page of this largest pathway to see its real structure.

Solution

df1 = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
n_gene = table(df1[, 2])
hist(n_gene)

Making the intervals smaller is better:

hist(n_gene, nc = 100)

The largest pathway:

which.max(n_gene)

## path:hsa01100 
##            86

Its name:

df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
df2[df2[, 1] == "hsa01100", ]

##         V1                                        V2
## 1 hsa01100 Metabolic pathways - Homo sapiens (human)

And its url on KEGG: https://www.genome.jp/pathway/hsa01100.

Practice 3

Highlight the following genes “5527”, “7534”, “2033”, “1111”, “5885” on the pathway image of “Cell cycle”. Choose colors you like.

Solution

df2 = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
df2[, 2] = gsub(" - Homo .*$", "", df2[, 2])
pathway_id = df2[df2[, 2] == "Cell cycle", 1]
pathway_id

## [1] "hsa04110"

Prepare the color table:

5527    red,blue
7534    blue,red
2033    yello,darkgreen
1111    grey,red
5885    purple,blue

Then go to https://www.genome.jp/pathway/hsa04110 and paste the color table there.

Zuguang Gu z.gu@dkfz.de

2025-03-02

Academic use of KEGG

Non-academic use of KEGG

Directly read from the KEGG API

KEGGREST

Highlight genes on the pathway image

clusterProfiler

Practice

Practice 1

Solution

Practice 2

Solution

Practice 3

Solution