Topic 1-04: Reactome gene sets

Reactome is another popular pathway database. It organizes pathways in a hierarchical manner, which contains pathways and sub pathways or pathway components. The up-to-date pathway data can be direclty found at https://reactome.org/download-data (Mapping Files -> Identifier mapping files -> NCBI to All pathways)

reactome.db

There is a reactome.db package on Bioconductor. But note it is not an official Bioconductor data package.

The basic information of the data:

library(reactome.db)
reactome.db

## ReactomeDb object:
## | DBSCHEMA: REACTOME_DB
## | DBSCHEMAVERSION: 89
## | SOURCENAME: Reactome
## | SOURCEURL: https://reactome.org/download/current/databases/gk_current.sql.gz
## | SOURCEDATE: 19997
## | Supporting package: AnnotationDbi
## | Db type: ReactomeDb

The “SOURCEDATE” might be the version of the data.

In reactome.db, the important objects are

reactomePATHID2EXTID contains mappings between reactome pathway IDs and gene EntreZ IDs
reactomePATHID2NAME contains pathway names

library(reactome.db)
tb = toTable(reactomePATHID2EXTID)
head(tb)

##           DB_ID gene_id
## 1  R-HSA-109582       1
## 2  R-HSA-114608       1
## 3  R-HSA-168249       1
## 4  R-HSA-168256       1
## 5 R-HSA-6798695       1
## 6   R-HSA-76002       1

p2n = toTable(reactomePATHID2NAME)
head(p2n)

##           DB_ID
## 1 R-BTA-1971475
## 2 R-BTA-1369062
## 3  R-BTA-382556
## 4 R-BTA-9033807
## 5  R-BTA-418592
## 6  R-BTA-392170
##                                                                     path_name
## 1 Bos taurus: A tetrasaccharide linker sequence is required for GAG synthesis
## 2                           Bos taurus: ABC transporters in lipid homeostasis
## 3                          Bos taurus: ABC-family proteins mediated transport
## 4                                    Bos taurus: ABO blood group biosynthesis
## 5                       Bos taurus: ADP signalling through P2Y purinoceptor 1
## 6                      Bos taurus: ADP signalling through P2Y purinoceptor 12

In the previous code, we use the function toTable() to retrieve the data as a data frame. You can also try as.list() on the two objects and compare the output.

Reactome contains pathway for multiple organisms. In the reactome ID, the second section contains the organism, e.g. in previous output HSA.

sort(table( gsub("^R-(\\w+)-\\d+$", "\\1", p2n[, 1]) ))

## 
##  MTU  PFA  SCE  SPO  DDI  CEL  DRE  XTR  DME  CFA  SSC  BTA  RNO  GGA  MMU  HSA 
##   13  641  855  872 1035 1366 1374 1520 1528 1714 1734 1742 1750 1762 1762 2711

barplot(sort(table( gsub("^R-(\\w+)-\\d+$", "\\1", p2n[, 1]) )))

Directly get data from the database

If you are not sure whether reactome.db provides the most up-to-date data, we can directly process the original data fro reactome database.

download.file("https://reactome.org/download/current/NCBI2Reactome_All_Levels.txt", destfile = "NCBI2Reactome_All_Levels.txt")
tb = read.table("NCBI2Reactome_All_Levels.txt", sep = "\t", comment.char = "")
head(tb)

##   V1            V2                                                  V3
## 1  1  R-HSA-109582  https://reactome.org/PathwayBrowser/#/R-HSA-109582
## 2  1  R-HSA-114608  https://reactome.org/PathwayBrowser/#/R-HSA-114608
## 3  1  R-HSA-168249  https://reactome.org/PathwayBrowser/#/R-HSA-168249
## 4  1  R-HSA-168256  https://reactome.org/PathwayBrowser/#/R-HSA-168256
## 5  1 R-HSA-6798695 https://reactome.org/PathwayBrowser/#/R-HSA-6798695
## 6  1   R-HSA-76002   https://reactome.org/PathwayBrowser/#/R-HSA-76002
##                                               V4  V5           V6
## 1                                     Hemostasis TAS Homo sapiens
## 2                        Platelet degranulation  TAS Homo sapiens
## 3                           Innate Immune System TAS Homo sapiens
## 4                                  Immune System TAS Homo sapiens
## 5                       Neutrophil degranulation TAS Homo sapiens
## 6 Platelet activation, signaling and aggregation TAS Homo sapiens

Restrict to human pathways. Note do not use column 6 for filtering.

tb2 = tb[grepl("HSA", tb[, 2]), ]

The gene sets

gs = split(tb2[, 1], tb2[, 2])

It is better to ensure Entrez IDs are stored as characters and genes are unique

gs = lapply(gs, function(x) unique(as.character(x)))

Pathway names:

p2n = unique(tb[, c(2, 4)])

Use simona to manually construct gene sets

The file we used “NCBI2Reactome_All_Levels.txt” has the description
“All levels of the pathway hierarchy” from the reactome website. Note pathways are organized in a hierarchical way (an ontology) where if a gene is annotated to a pathway, it should be annotated to all its ancestor pathways. For most gene annotations on ontologies (such as GO), genes are only annotated to the lowest terms in the ontology. We ask, in “NCBI2Reactome_All_Levels.txt”, are all genes correctly assigned to all ancestor pathways?

Let’s manually create an ontology object and merge gene annotations.

Read the hierarchical relations of pathways. Note the first column contains parent pathways and the second column contains child pathways:

library(simona)
df = read.table("https://reactome.org/download/current/ReactomePathwaysRelation.txt", sep = "\t")
df = df[grepl("HSA", df[, 1]) & grepl("HSA", df[, 2]), ]
head(df)

##                 V1            V2
## 10760 R-HSA-109581  R-HSA-109606
## 10761 R-HSA-109581  R-HSA-169911
## 10762 R-HSA-109581 R-HSA-5357769
## 10763 R-HSA-109581   R-HSA-75153
## 10764 R-HSA-109582  R-HSA-140877
## 10765 R-HSA-109582  R-HSA-202733

Create an ontology object with simona and also add gene annotations.

dag = create_ontology_DAG(df[, 1], df[, 2], annotation = gs)
dag

## An ontology_DAG object:
##   Source: Ontology 
##   2770 terms / 2816 relations
##   Root: ~~all~~ 
##   Terms: R-HSA-1059683, R-HSA-109581, R-HSA-109582, R-HSA-109606,
##          ...
##   Max depth: 12 
##   Avg number of parents: 1.02
##   Avg number of children: 1.03
##   Aspect ratio: 110.5:1 (based on the longest distance from root)
##                 112:1 (based on the shortest distance from root)
##   Annotations: 5365 items
##                3569, 3570, 3572, 3716, ...

term_annotations() automatically merges genes for ancestor pathways:

gs3 = term_annotations(dag, setdiff(dag_all_terms(dag), "~~all~~"))
gs3 = gs3[sapply(gs3, length) > 0]

For this method, GSEAtopics has a helper function load_reactome_genesets() that wraps the code above:

lt = load_reactome_genesets("HSA")
lt$name  # pathway names
lt$gs    # pathway gene sets

Compare the three methods

Now we have three different ways to get Reactome pathway gene sets, let’s compare their consistency.

Let’s first rename the three types of gene sets objects.

gs_web = gs
gs1 = as.list(reactomePATHID2EXTID)
gs_pkg = gs1[grepl("HSA", names(gs1))]
gs_ontology = gs3

The overlap:

library(eulerr)
plot(euler(list(gs_pkg = names(gs_pkg), 
                gs_web = names(gs_web),
                gs_ontology = names(gs_ontology))), quantities = TRUE)

gs_web and gs_ontology have the same set of pathways.

setequal(names(gs_web), names(gs_ontology))

## [1] TRUE

And we compare the numbers of genes in pathways:

cn = intersect(intersect(names(gs_pkg), names(gs_web)), names(gs_ontology))
n1 = sapply(gs_pkg[cn], length)
n2 = sapply(gs_web[cn], length)
n3 = sapply(gs_ontology[cn], length)

par(mfrow = c(1, 2))
plot(n1, n2, log = "xy", xlab = "reactome.db", ylab = "website")
plot(n2, n3, log = "xy", xlab = "website", ylab = "ontology")

Conclusion: if you use reactome pathways, use method 2 or 3.

Practice

Practice 1

Make the distribution of the numbers of genes in Reactome pathways (use human).

Print the names of pathways with numbers of genes > 2000.

library(GSEAtopics)
lt = load_reactome_genesets("HSA")
n_gene = sapply(lt$gs, length)
hist(n_gene)

Make the intervals smaller:

hist(n_gene, nc = 100)

Pathways more than 2000 genes:

n_gene[n_gene > 2000]

## R-HSA-1430728  R-HSA-162582 R-HSA-1643685  R-HSA-168256  R-HSA-392499 
##          2178          2597          2084          2062          2103

Their names:

lt$name[ names(n_gene[n_gene > 2000]) ]

##            R-HSA-1430728             R-HSA-162582            R-HSA-1643685 
##             "Metabolism"    "Signal Transduction"                "Disease" 
##             R-HSA-168256             R-HSA-392499 
##          "Immune System" "Metabolism of proteins"

If you go to https://reactome.org/PathwayBrowser/, these big pathways correspond to the pathway clusters on the top level.