vignettes/topic5_01_similarity.Rmd
topic5_01_similarity.Rmd
There are two types of methods to calculate similarities between functional terms.
We demonstrate it with significant terms from an ORA analysis.
library(GSEAtraining)
lt = readRDS(system.file("extdata", "ora.rds", package = "GSEAtraining"))
diff_gene = lt$diff_gene
diff_gene = convert_to_entrez_id(diff_gene)
##
## gene id might be SYMBOL (p = 0.980 )
## 'select()' returned 1:many mapping between keys and columns
First we use a small collection of gene sets, the KEGG pathways.
df = read.table(url("https://rest.kegg.jp/link/pathway/hsa"), sep = "\t")
df[, 1] = gsub("hsa:", "", df[, 1])
df[, 2] = gsub("path:", "", df[, 2])
head(df)
## V1 V2
## 1 10327 hsa00010
## 2 124 hsa00010
## 3 125 hsa00010
## 4 126 hsa00010
## 5 127 hsa00010
## 6 128 hsa00010
We also retrieve the pathway names.
pathway_names = read.table(url("https://rest.kegg.jp/list/pathway/hsa"), sep = "\t")
pathway_names = structure(gsub(" - Homo .*$", "", pathway_names[, 2]), names = pathway_names[, 1])
Perform ORA analysis with clusterProfiler. There are 33 significant pathways.
##
## Registered S3 methods overwritten by 'treeio':
## method from
## MRCA.phylo tidytree
## MRCA.treedata tidytree
## Nnode.treedata tidytree
## Ntip.treedata tidytree
## ancestor.phylo tidytree
## ancestor.treedata tidytree
## child.phylo tidytree
## child.treedata tidytree
## full_join.phylo tidytree
## full_join.treedata tidytree
## groupClade.phylo tidytree
## groupClade.treedata tidytree
## groupOTU.phylo tidytree
## groupOTU.treedata tidytree
## inner_join.phylo tidytree
## inner_join.treedata tidytree
## is.rooted.treedata tidytree
## nodeid.phylo tidytree
## nodeid.treedata tidytree
## nodelab.phylo tidytree
## nodelab.treedata tidytree
## offspring.phylo tidytree
## offspring.treedata tidytree
## parent.phylo tidytree
## parent.treedata tidytree
## root.treedata tidytree
## rootnode.phylo tidytree
## sibling.phylo tidytree
## clusterProfiler v4.8.3 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
##
## If you use clusterProfiler in published research, please cite:
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
##
## Attaching package: 'clusterProfiler'
## The following object is masked from 'package:stats':
##
## filter
tb = enricher(diff_gene, TERM2GENE = df[, 2:1])
sig_pathways = as.data.frame(tb)$ID
gs_pathway = split(df[, 1], df[, 2])
gs_pathway = gs_pathway[sig_pathways]
There are four similarity methods which are based on the counts:
Since Dice coeffcient is just a simple transformation from the Jaccard coeffcient, we only look at the following three methods:
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: grid
## ========================================
## simplifyEnrichment version 1.11.1
## Bioconductor page: https://bioconductor.org/packages/simplifyEnrichment/
## Github page: https://github.com/jokergoo/simplifyEnrichment
## Documentation: https://jokergoo.github.io/simplifyEnrichment/
## Examples: https://simplifyenrichment.github.io/
##
## If you use it in published research, please cite:
## Gu, Z. simplifyEnrichment: an R/Bioconductor package for Clustering and
## Visualizing Functional Enrichment Results, Genomics, Proteomics &
## Bioinformatics 2022.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(simplifyEnrichment))
## ========================================
m1 = term_similarity(gs_pathway, method = "jaccard")
m2 = term_similarity(gs_pathway, method = "overlap")
m3 = term_similarity(gs_pathway, method = "kappa")
Values on diagonals of the three symmetric matrices are 1. Since other values in the matrices are much smaller than 1, we set the diagonals to NA to improve the heatmap visualization.
diag(m1) = NA
diag(m2) = NA
diag(m3) = NA
library(ComplexHeatmap)
## ========================================
## ComplexHeatmap version 2.18.0
## Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
## Github page: https://github.com/jokergoo/ComplexHeatmap
## Documentation: http://jokergoo.github.io/ComplexHeatmap-reference
##
## If you use it in published research, please cite either one:
## - Gu, Z. Complex Heatmap Visualization. iMeta 2022.
## - Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional
## genomic data. Bioinformatics 2016.
##
##
## The new InteractiveComplexHeatmap package can directly export static
## complex heatmaps into an interactive Shiny app with zero effort. Have a try!
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(ComplexHeatmap))
## ========================================
Heatmap(m1, name = "jaccard") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
Heatmap(m2, name = "overlap") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
Heatmap(m3, name = "kappa") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
In previous examples, we use all genes in the gene sets. Next we restrict only to the DE genes.
m1 = term_similarity(gs_pathway, method = "jaccard", all = diff_gene)
m2 = term_similarity(gs_pathway, method = "overlap", all = diff_gene)
m3 = term_similarity(gs_pathway, method = "kappa", all = diff_gene)
diag(m1) = NA
diag(m2) = NA
diag(m3) = NA
Heatmap(m1, name = "jaccard") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
## The automatically generated colors map from the 1^st and 99^th of the
## values in the matrix. There are outliers in the matrix whose patterns
## might be hidden by this color mapping. You can manually set the color
## to `col` argument.
##
## Use `suppressMessages()` to turn off this message.
Heatmap(m2, name = "overlap") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
Heatmap(m3, name = "kappa") +
rowAnnotation(pathway_name = anno_text(pathway_names[rownames(m1)]))
Next we use a much larger gene set collection: the GO gene sets:
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: IRanges
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.3.2
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:clusterProfiler':
##
## rename
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
##
## Attaching package: 'IRanges'
## The following object is masked from 'package:clusterProfiler':
##
## slice
##
## Attaching package: 'AnnotationDbi'
## The following object is masked from 'package:clusterProfiler':
##
## select
tb = enrichGO(gene = diff_gene, ont = "BP", OrgDb = org.Hs.eg.db)
sig_go = as.data.frame(tb)$ID
There are > 800 significant GO BP terms.
gs_bp = get_GO_gene_sets_from_orgdb(org.Hs.eg.db)
## 'select()' returned 1:many mapping between keys and columns
gs_bp = gs_bp[sig_go]
m1 = term_similarity(gs_bp, method = "jaccard")
m2 = term_similarity(gs_bp, method = "overlap")
m3 = term_similarity(gs_bp, method = "kappa")
diag(m1) = NA
diag(m2) = NA
diag(m3) = NA
Heatmap(m1, name = "jaccard", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
## The automatically generated colors map from the 1^st and 99^th of the
## values in the matrix. There are outliers in the matrix whose patterns
## might be hidden by this color mapping. You can manually set the color
## to `col` argument.
##
## Use `suppressMessages()` to turn off this message.
Heatmap(m2, name = "overlap", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
Heatmap(m3, name = "kappa", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
## The automatically generated colors map from the 1^st and 99^th of the
## values in the matrix. There are outliers in the matrix whose patterns
## might be hidden by this color mapping. You can manually set the color
## to `col` argument.
##
## Use `suppressMessages()` to turn off this message.
Or only restrict in DE genes
m1 = term_similarity(gs_bp, method = "jaccard", all = diff_gene)
m2 = term_similarity(gs_bp, method = "overlap", all = diff_gene)
m3 = term_similarity(gs_bp, method = "kappa", all = diff_gene)
diag(m1) = NA
diag(m2) = NA
diag(m3) = NA
Heatmap(m1, name = "jaccard", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
## The automatically generated colors map from the 1^st and 99^th of the
## values in the matrix. There are outliers in the matrix whose patterns
## might be hidden by this color mapping. You can manually set the color
## to `col` argument.
##
## Use `suppressMessages()` to turn off this message.
Heatmap(m2, name = "overlap", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
Heatmap(m3, name = "kappa", show_row_names = FALSE, show_column_names = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE)
## The automatically generated colors map from the 1^st and 99^th of the
## values in the matrix. There are outliers in the matrix whose patterns
## might be hidden by this color mapping. You can manually set the color
## to `col` argument.
##
## Use `suppressMessages()` to turn off this message.
There are also two major types of methods for semantic similarities.
dag = create_ontology_DAG_from_GO_db("BP", org_db = org.Hs.eg.db)
The Sim_Lin_1998 is one of the classic semantic similarity method.
m = term_sim(dag, sig_go, method = "Sim_Lin_1998")
## remove 16 terms with no annotation.
dag_circular_viz(dag, highlight = sig_go)