simona supports many methods for calculating semantic similarities. In this document, we compare different term similarity methods using Gene Ontology as the test ontology.
By default, we use the Biological Process (BP) namespace in GO. We only take "is_a"
and "part_of"
as the relation types. The org_db
argument is also set to human for the annotation-based method.
library(simona)
dag = create_ontology_DAG_from_GO_db(org_db = "org.Hs.eg.db")
dag
## An ontology_DAG object:
## Source: GO BP / GO.db package 3.19.1
## 27186 terms / 54178 relations
## Root: GO:0008150
## Terms: GO:0000001, GO:0000002, GO:0000003, GO:0000011, ...
## Max depth: 18
## Avg number of parents: 1.99
## Avg number of children: 1.87
## Aspect ratio: 356.46:1 (based on the longest distance from root)
## 756.89:1 (based on the shortest distance from root)
## Relations: is_a, part_of
## Annotations: 18888 items
## 291, 1890, 4205, 4358, ...
##
## With the following columns in the metadata data frame:
## id, name, definition
All term similarity methods supported in simona are listed as follows. The full description of these methods can be found in the vignettes of simona.
all_term_sim_methods()
## [1] "Sim_Lin_1998" "Sim_Resnik_1999" "Sim_FaITH_2010"
## [4] "Sim_Relevance_2006" "Sim_SimIC_2010" "Sim_XGraSM_2013"
## [7] "Sim_EISI_2015" "Sim_AIC_2014" "Sim_Zhang_2006"
## [10] "Sim_universal" "Sim_Wang_2007" "Sim_GOGO_2018"
## [13] "Sim_Rada_1989" "Sim_Resnik_edge_2005" "Sim_Leocock_1998"
## [16] "Sim_WP_1994" "Sim_Slimani_2006" "Sim_Shenoy_2012"
## [19] "Sim_Pekar_2002" "Sim_Stojanovic_2001" "Sim_Wang_edge_2012"
## [22] "Sim_Zhong_2002" "Sim_AlMubaid_2006" "Sim_Li_2003"
## [25] "Sim_RSS_2013" "Sim_HRSS_2013" "Sim_Shen_2010"
## [28] "Sim_SSDD_2013" "Sim_Jiang_1997" "Sim_Kappa"
## [31] "Sim_Jaccard" "Sim_Dice" "Sim_Overlap"
## [34] "Sim_Ancestor"
We compare all supported term similarity methods, using 500 random GO terms. Since we also compare annotation-based methods, we randomly sample 500 GO terms from those having gene annotations.
set.seed(123)
ic = term_IC(dag, method = "IC_annotation")
ic = ic[!is.na(ic)]
go_id = sample(names(ic), 500)
We calculate similarities with different term similarity methods and save the results in a list.
lt = lapply(all_term_sim_methods(), function(method) {
term_sim(dag, go_id, method)
})
names(lt) = all_term_sim_methods()
For comparison, we take the lower triangle matrix and merge all similarities values from different methods into a data frame.
df = as.data.frame(lapply(lt, function(x) x[lower.tri(x)]))
We calculate the correlations between similarity vectors from different methods and make the correlation heatmap. As 1 - correlation is also a valid dissimilarity measurement, we directly generate the hierarchical clustering based on 1 - cor
.
cor = cor(df, use = "pairwise.complete.obs")
hc = hclust(as.dist(1 - cor))
library(ComplexHeatmap)
Heatmap(cor, name = "correlation", cluster_rows = hc, cluster_columns = hc,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
column_title = "Semantic similarity correlation, GO BP")
We next remove the following similarity methods which show very different patterns to other methods, and remake the heatmap.
ind = which(colnames(df) %in% c("Sim_Jiang_1997", "Sim_HRSS_2013", "Sim_universal",
"Sim_Dice", "Sim_Kappa", "Sim_Jaccard", "Sim_Overlap"))
cor2 = cor[-ind, -ind]
df2 = df[, -ind]
hc = hclust(as.dist(1 - cor2))
Heatmap(cor2, name = "correlation", cluster_rows = hc, cluster_columns = hc,
row_dend_reorder = TRUE, column_dend_reorder = TRUE,
column_title = "Semantic similarity correlation, GO BP")
We can observe that all the similarity methods can be put into five groups. We manually add group labels to these methods:
group = c("Sim_Shen_2010" = 1,
"Sim_Zhang_2006" = 1,
"Sim_EISI_2015" = 1,
"Sim_XGraSM_2013" = 1,
"Sim_Resnik_1999" = 1,
"Sim_Lin_1998" = 1,
"Sim_FaITH_2010" = 1,
"Sim_Relevance_2006" = 1,
"Sim_SimIC_2010" = 1,
"Sim_SSDD_2013" = 2,
"Sim_RSS_2013" = 2,
"Sim_Zhong_2002" = 2,
"Sim_Slimani_2006" = 2,
"Sim_Pekar_2002" = 3,
"Sim_WP_1994" = 3,
"Sim_Shenoy_2012" = 3,
"Sim_Stojanovic_2001" = 3,
"Sim_Li_2003" = 3,
"Sim_Wang_edge_2012" = 3,
"Sim_Wang_2007" = 4,
"Sim_Ancestor" = 4,
"Sim_AIC_2014" = 4,
"Sim_GOGO_2018" = 4,
"Sim_AlMubaid_2006" = 5,
"Sim_Leocock_1998" = 5,
"Sim_Rada_1989" = 5,
"Sim_Resnik_edge_2005" = 5)
Next we perform MDS (multidimension scaling) analysis on the 1-cor
distance matrix and visualize the first two dimensions.
library(ggrepel)
library(ggplot2)
loc = cmdscale(as.dist(1-cor2))
loc = as.data.frame(loc)
colnames(loc) = c("x", "y")
loc$method = rownames(loc)
loc$group = group[rownames(loc)]
ggplot(loc, aes(x, y, label = method, col = factor(group))) +
geom_point() +
geom_text_repel(show.legend = FALSE, size = 3) +
labs(x = "Dimension 1", y = "Dimension 2", col = "Group") +
ggtitle("MDS on similarity methods on random 500 GO terms")
Following are the pairwise scatterplots of semantic similarities from two term similarity methods. The five groups are from the heatmap in Figure S4.2. Since each point correspond to a term-pair, the color of the point corresponds to the depth of their LCA term. Note that for those methods based on MICA terms, as shown in the document “Compare topology and annotation-based semantic similarity methods”, majority of LCA terms and MICA terms are the same. So in this comparison, we can take \(\mathrm{LCA} \approx \mathrm{MICA}\).
The methods in the five groups are summarized as follows:
Finally, you can select an individual heatmap of similarities calculated by: Use order from Sim_Lin_1998
Prev method: Sim_Ancestor Curr method: Sim_Lin_1998 Next method: Sim_Resnik_1999
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-apple-darwin13.4.0
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS/LAPACK: /Users/guz/opt/miniconda3/envs/R-4.4.1/lib/libopenblasp-r0.3.27.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## time zone: Europe/Berlin
## tzcode source: system (macOS)
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GetoptLong_1.0.5 ggrepel_0.9.5 ggplot2_3.5.1
## [4] ComplexHeatmap_2.20.0 simona_1.3.12 knitr_1.48
## [7] rmarkdown_2.28
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 farver_2.1.2 dplyr_1.1.4
## [4] blob_1.2.4 Biostrings_2.72.1 fastmap_1.2.0
## [7] promises_1.3.0 digest_0.6.37 mime_0.12
## [10] lifecycle_1.0.4 cluster_2.1.6 KEGGREST_1.44.1
## [13] RSQLite_2.3.7 magrittr_2.0.3 compiler_4.4.1
## [16] rlang_1.1.4 sass_0.4.9 tools_4.4.1
## [19] utf8_1.2.4 igraph_2.0.3 yaml_2.3.10
## [22] labeling_0.4.3 bit_4.0.5 scatterplot3d_0.3-44
## [25] xml2_1.3.6 RColorBrewer_1.1-3 withr_3.0.1
## [28] BiocGenerics_0.50.0 stats4_4.4.1 fansi_1.0.6
## [31] xtable_1.8-4 colorspace_2.1-1 GO.db_3.19.1
## [34] scales_1.3.0 iterators_1.0.14 cli_3.6.3
## [37] crayon_1.5.3 generics_0.1.3 ragg_1.3.2
## [40] httr_1.4.7 rjson_0.2.22 DBI_1.2.3
## [43] cachem_1.1.0 zlibbioc_1.50.0 parallel_4.4.1
## [46] AnnotationDbi_1.66.0 XVector_0.44.0 proxyC_0.4.1
## [49] matrixStats_1.3.0 vctrs_0.6.5 Matrix_1.7-0
## [52] jsonlite_1.8.8 IRanges_2.38.1 S4Vectors_0.42.1
## [55] bit64_4.0.5 clue_0.3-65 systemfonts_1.1.0
## [58] foreach_1.5.2 jquerylib_0.1.4 glue_1.7.0
## [61] codetools_0.2-20 Polychrome_1.5.1 shape_1.4.6.1
## [64] gtable_0.3.5 later_1.3.2 GenomeInfoDb_1.40.1
## [67] UCSC.utils_1.0.0 munsell_0.5.1 tibble_3.2.1
## [70] pillar_1.9.0 htmltools_0.5.8.1 GenomeInfoDbData_1.2.12
## [73] circlize_0.4.16 R6_2.5.1 textshaping_0.4.0
## [76] doParallel_1.0.17 evaluate_0.24.0 shiny_1.9.1
## [79] Biobase_2.64.0 lattice_0.22-6 highr_0.11
## [82] png_0.1-8 memoise_2.0.1 httpuv_1.6.15
## [85] bslib_0.8.0 Rcpp_1.0.13 org.Hs.eg.db_3.19.1
## [88] xfun_0.47 pkgconfig_2.0.3 GlobalOptions_0.1.2