The methods of semantic similarity implemented in simona are mainly from the supplementary file of the paper “Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017”. Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette.

Denotations

The following denotations will be used throughout the vignette. The denotations are mainly from Mazandu 2017 only with small modifications.

Denotation Description
rr The root term of the DAG. In simona there is always one root term.
δ(x)\delta(x) The depth of a term xx in the DAG, which is the longest distance from root rr.
δs(x)\delta_s(x) The length of the longest path from root rr to a term xx via term ss.
δmax\delta_\max The maximal depth in the DAG.
η(x)\eta(x) The height of term xx in the DAG, which is the longest finite distance to leaf terms.
𝒞s\mathcal{C}_s The set of child terms of term ss.
𝒫s\mathcal{P}_s The set of parent terms of term ss.
𝒜s\mathcal{A}_s The set of ancestor terms of term ss.
𝒜s+\mathcal{A}_s^+ The set of ancestor terms of term ss, including ss itself.
𝒟s\mathcal{D}_s The set of offspring terms of term ss.
𝒟s+\mathcal{D}_s^+ The set of offspring terms of term ss, including ss itself.
s\mathcal{L}_s The set of leaf terms that term ss can reach.
|A|\left| A \right| Number of elements in set AA.
Dsp(a,b)D_\mathrm{sp}(a, b) The shortest distance bewteen aa and bb.
len(a,b)\mathrm{len}(a, b) The longest distance bewteen aa and bb.
lens(a,b)\mathrm{len}_s(a, b) The length of the longest path from aa and bb via ss.
CA(a,b)\mathrm{CA}(a, b) The set of common ancestors of term aa and bb, i.e. CA(a,b)=𝒜a+𝒜b+\mathrm{CA}(a, b) = \mathcal{A}_a^+ \cap \mathcal{A}_b^+.
LCA(a,b)\mathrm{LCA}(a, b) Lowest common ancestor of aa and bb, which is the common ancestor with the largest depth in DAG, i.e. argmaxtCA(a,b)δ(t)\operatorname*{argmax}_{t \in \mathrm{CA}(a, b)} \delta(t) There might be more than one LCA terms for given two terms, to simplify the calculation, the one with the longest distance (the default) to aa and bb is used.
NCA(a,b)\mathrm{NCA}(a, b) Nearest common ancestor of aa and bb, i.e. argmintCA(a,b)(Dsp(t,a)+Dsp(t,b))\operatorname*{argmin}_{t \in \mathrm{CA}(a, b)} \left( D_\mathrm{sp}(t, a) + D_\mathrm{sp}(t, b) \right) If there are more than one NCA terms, the one with the largest depth (the lowest one) is used.
MICA(a,b)\mathrm{MICA}(a, b) Most informative common ancestor of aa and bb, i.e. argmaxtCA(a,b)(IC(t))\operatorname*{argmax}_{t \in \mathrm{CA}(a, b)} \left( \mathrm{IC}(t) \right ) There might be more than one MICA terms for given two terms, the one with the longest distance (the default) to aa and bb is used.
GsG_s The set of annotated items on term ss.

Assume term aa is an ancestor of term bb, Dsp(a,b)D_\mathrm{sp}(a, b) (the order of aa and bb does not matter) is the normal shortest distance from aa to bb in a directed graph. The definition is similar for len(a,b)\mathrm{len}(a, b).

If term aa and bb are not in offspring/ancestor relationship, i.e. aa is not an ancestor of bb, and bb is not an ancestor of aa, then

Dsp(a,b)=mintCA(a,b)(Dsp(t,a)+Dsp(t,b))len(a,b)=maxtCA(a,b)(len(t,a)+len(t,b)) \begin{align*} D_\mathrm{sp}(a, b) &= \min_{t \in \mathrm{CA}(a, b)} \left( D_\mathrm{sp}(t, a) + D_\mathrm{sp}(t, b) \right) \\ \mathrm{len}(a, b) &= \max_{t \in \mathrm{CA}(a, b)} \left( \mathrm{len}(t, a) + \mathrm{len}(t, b) \right) \end{align*}

General

The wrapper function term_sim() calculates semantic similarities between terms in the DAG with a specific method. Note the method name can be partially matched. control argument controls parameters for specific methods.

term_sim(dag, terms, method = ..., control = list(...))

All supported term similarity methods are:

##  [1] "Sim_Lin_1998"         "Sim_Resnik_1999"      "Sim_FaITH_2010"      
##  [4] "Sim_Relevance_2006"   "Sim_SimIC_2010"       "Sim_XGraSM_2013"     
##  [7] "Sim_EISI_2015"        "Sim_AIC_2014"         "Sim_Zhang_2006"      
## [10] "Sim_universal"        "Sim_Wang_2007"        "Sim_GOGO_2018"       
## [13] "Sim_Rada_1989"        "Sim_Resnik_edge_2005" "Sim_Leocock_1998"    
## [16] "Sim_WP_1994"          "Sim_Slimani_2006"     "Sim_Shenoy_2012"     
## [19] "Sim_Pekar_2002"       "Sim_Stojanovic_2001"  "Sim_Wang_edge_2012"  
## [22] "Sim_Zhong_2002"       "Sim_AlMubaid_2006"    "Sim_Li_2003"         
## [25] "Sim_RSS_2013"         "Sim_HRSS_2013"        "Sim_Shen_2010"       
## [28] "Sim_SSDD_2013"        "Sim_Jiang_1997"       "Sim_Kappa"           
## [31] "Sim_Jaccard"          "Sim_Dice"             "Sim_Overlap"         
## [34] "Sim_Ancestor"

IC-based or node-based methods

This type of methods consider a special ancestor term cc of terms aa and bb, which has the highest IC among all aa and bb’s ancestor terms. Term cc is called the most informative common ancestor (MICA) which can be given by:

IC(c)=maxt𝒜a+𝒜b+IC(t) \mathrm{IC}(c) = \max_{t \in \mathcal{A}_a^+ \cap \mathcal{A}_b^+} \mathrm{IC}(t)

So if two terms are identical, MICA is the term itself, and if two terms have ancestor/offspring relationship, MICA is the ancestor term.

In the following sections, if not specially mentioned, cc is always referred to the MICA of aa and bb.

Sim_Lin_1998

The similarity is calculated as the IC of the MICA term cc normalized by the average of the IC of the two terms:

Sim(a,b)=IC(c)(IC(a)+IC(b))/2=2*IC(c)IC(a)+IC(b) \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(c)}{(\mathrm{IC}(a) + \mathrm{IC}(b))/2} = \frac{2 * \mathrm{IC}(c)}{\mathrm{IC}(a) + \mathrm{IC}(b)}

term_sim(dag, terms, method = "Sim_Lin_1998")

Paper link: https://dl.acm.org/doi/10.5555/645527.657297.

Sim_Resnik_1999

IC of the MICA term itself IC(c)\mathrm{IC}(c) can be a measure of how similar two terms are, but its range is not in [0, 1]. There are several ways to normalize IC(c)\mathrm{IC}(c) to the range of [0, 1]. Note some of the normalization methods are restricted to IC_annotation as the IC method.

Nunif

It is normalized to the possible maximal IC value where a term only has one item annotated.

Sim(a,b)=IC(c)log(1/N)=IC(c)logN \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(c)}{-\log(1/N)} = \frac{\mathrm{IC}(c)}{\log N}

where NN is the total number of items annotated to the whole DAG.

Nmax

It is similar to Nunif, but normalized to the maximal IC of all terms in the DAG. If there is a term with only one item annotated, Nmax is identical to the Nunif method.

Sim(a,b)=IC(c)ICmax \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(c)}{\mathrm{IC}_\mathrm{max}}

Nunivers

IC(c)\mathrm{IC}(c) is normalized by the maximal IC of term aa and bb.

Sim(a,b)=IC(c)max{IC(a),IC(b)} \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(c)}{\max \{ \mathrm{IC}(a), \mathrm{IC}(b) \}}

Paper link: https://doi.org/10.1613/jair.514, https://doi.org/10.1186/1471-2105-9-S5-S4, https://doi.org/10.1186/1471-2105-11-562, https://doi.org/10.1155/2013/292063.

The normalization method can be set with the norm_method parameter:

term_sim(dag, terms, method = "Sim_Resnik_1999", control = list(norm_method = "Nunif"))
term_sim(dag, terms, method = "Sim_Resnik_1999", control = list(norm_method = "Nmax"))
term_sim(dag, terms, method = "Sim_Resnik_1999", control = list(norm_method = "Nunivers"))

Sim_FaITH_2010

It is calculated as:

Sim(a,b)=IC(c)IC(a)+IC(b)IC(c) \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(c)}{\mathrm{IC}(a) + \mathrm{IC}(b) - \mathrm{IC}(c)}

The relation between the FaITH_2010 similarity and Lin_1998 similarity is:

SimFaITH=SimLin2SimLin \mathrm{Sim}_\mathrm{FaITH} = \frac{\mathrm{Sim}_\mathrm{Lin}}{2 - \mathrm{Sim}_\mathrm{Lin}}

term_sim(dag, terms, method = "Sim_FaITH_2010")

Paper link: https://doi.org/10.1007/978-3-642-17746-0_39.

Sim_Relevance_2006

If thinking Lin_1998 is a measure of how close term aa and bb are to their MICA term cc, the relevance method corrects it by multiplying a factor which considers the specificity of how cc brings the information. The factor is calculated as 1p(c)1-p(c) where p(c)p(c) is the annotation-based probability p(c)=k/Np(c) = k/N where kk is the number of items annotated to cc and NN is the total number of items annotated to the DAG. Then under the Relevance method, the corrected IC of cc is:

ICcorrected(c)=(1p(c))*IC(c) \mathrm{IC}_\mathrm{corrected}(c) = (1-p(c)) * \mathrm{IC}(c)

If using Lin_1998 as the similarity method, the corrected version Relevance similarity is:

Sim(a,b)=2*ICcorrected(c)IC(a)+IC(b)=(1p(c))*2*IC(c)IC(a)+IC(b)=(1p(c))*SimLin(a,b) \begin{align*} \mathrm{Sim}(a, b) & = \frac{2*\mathrm{IC}_\mathrm{corrected}(c)}{\mathrm{IC}(a) + \mathrm{IC}(b)} \\ & = (1-p(c)) * \frac{2 * \mathrm{IC}(c)}{\mathrm{IC}(a) + \mathrm{IC}(b)} \\ & = (1-p(c)) * \mathrm{Sim}_\mathrm{Lin}(a, b) \end{align*}

The term p(c)p(c) requires that terms should be annotated to items. However, it can be extended to more general scenarios:

ICcorrected(c)=(1exp(IC(x)))*IC(c) \mathrm{IC}_\mathrm{corrected}(c) = \left(1 - \exp(-\mathrm{IC}(x))\right) * \mathrm{IC}(c)

term_sim(dag, terms, method = "Sim_Relevance_2006")

Paper link: https://doi.org/10.1186/1471-2105-7-302

Sim_SimIC_2010

The SimIC_2010 method is an improved correction method of the Relevance method because the latter works badly when p(c)p(c) is very small. E.g., when 1p(c)1-p(c) is used as a correction factor, it cannot nicely distinguish p(c)=0.01p(c) = 0.01 and p(c)=0.001p(c) = 0.001 because for both 1p(c)1 - p(c) are very close to 1.

The SimIC_2010 correction factor for MICA term cc is:

ICcorrected(c)=111log(p(c))*IC(c) \mathrm{IC}_\mathrm{corrected}(c) = 1 - \frac{1}{1 - \log(p(c))} * \mathrm{IC}(c)

Then the similarity (if using Lin_1998 as the original similarity method) is:

Sim(a,b)=(111log(p(c)))*SimLin(a,b) \mathrm{Sim}(a, b) = \left( 1 - \frac{1}{1 - \log(p(c))} \right) * \mathrm{Sim}_\mathrm{Lin}(a, b)

Similarly, it can be generalized to:

Sim(a,b)=IC(x)1+IC(x)*SimLin(a,b) \mathrm{Sim}(a, b) = \frac{\mathrm{IC}(x)}{1 + \mathrm{IC}(x)} * \mathrm{Sim}_\mathrm{Lin}(a, b)

term_sim(dag, terms, method = "Sim_SimIC_2010")

Paper link: https://doi.org/10.48550/arXiv.1001.0958.

Sim_XGraSM_2013

Being different from the Relevance and SimIC_2010 methods that only use the IC of the MICA term, the XGraSM_2013 method as well as the next method use IC of a subset of common ancestor terms of aa and bb, and it uses the mean IC of them. The subset of common ancestor may have different names for different methods.

XGraSM_2013 is the simplest one which uses informative common ancestors (ICA) where IC of the common ancestor is positive.

ICA(a,b)={c𝒜a+𝒜b+:IC(c)>0} \mathrm{ICA}(a, b) = \{c \in \mathcal{A}_a^+ \cap \mathcal{A}_b^+: \mathrm{IC}(c) > 0\}

And mean IC among all ICA terms:

ICmean=1|ICA(a,b)|tICA(a,b)IC(t) \mathrm{IC}_\mathrm{mean} = \frac{1}{|\mathrm{ICA}(a, b)|} \sum_{\mathrm{t \in \mathrm{ICA}(a, b)}} \mathrm{IC}(t)

And applying Lin_1998 method, the semantic similarity is:

Sim(a,b)=2*ICmeanIC(a)+IC(b) \mathrm{Sim}(a, b) = 2 * \frac{\mathrm{IC}_\mathrm{mean}}{\mathrm{IC}(a) + \mathrm{IC}(b)}

term_sim(dag, terms, method = "Sim_XGraSM_2013")

Paper link: https://doi.org/10.1186/1471-2105-14-284

Sim_EISI_2015

It selects a specific subset of common ancestors of terms aa and bb. It only selects a common ancestor cc which can reach aa or bb via one of its child terms that does not belong to the common ancestors (mutual exclusively in aa’s ancestors or in bb’s ancestors). The set of the selected common ancestors is called the exclusively inherited common ancestors (EICA).

EICA(a,b)={c𝒜a𝒜b:𝒞c((𝒜a𝒜b)(𝒜a𝒜b))} \mathrm{EICA}(a, b) = \{c \in \mathcal{A}_a \cap \mathcal{A}_b: \mathcal{C}_c \cap \left( (\mathcal{A}_a \cup \mathcal{A}_b) - (\mathcal{A}_a \cap \mathcal{A}_b) \neq \emptyset \right) \}

And mean IC among all EICA terms:

ICmean=1|EICA(a,b)|tEICA(a,b)IC(t) \mathrm{IC}_\mathrm{mean} = \frac{1}{|\mathrm{EICA}(a, b)|} \sum_{\mathrm{t \in \mathrm{EICA}(a, b)}} \mathrm{IC}(t)

And applying Lin_1998 method, the semantic similarit is:

Sim(a,b)=2*ICmeanIC(a)+IC(b) \mathrm{Sim}(a, b) = 2 * \frac{\mathrm{IC}_\mathrm{mean}}{\mathrm{IC}(a) + \mathrm{IC}(b)}

term_sim(dag, terms, method = "Sim_EISI_2015")

Paper link: https://doi.org/10.1016/j.gene.2014.12.062

Sim_AIC_2014

It uses the aggregate information content from ancestors. First define the semantic weight denoted as SwS_w of a term tt in the DAG:

Sw(t)=11+exp(1IC(t)) S_w(t) = \frac{1}{1 + \exp \left(-\frac{1}{\mathrm{IC}(t)} \right)}

Then the similarity is calculated as the fraction of aggegation from common ancestors and the average aggregation from ancestors of aa and bb separately.

Sim(a,b)=2*t𝒜a+𝒜b+Sw(t)t𝒜a+Sw(t)+t𝒜b+Sw(t) \mathrm{Sim}(a, b) = \frac{2*\sum\limits_{t \in \mathcal{A}_a^+ \cap \mathcal{A}_b^+} S_w(t) }{ \sum\limits_{t \in \mathcal{A}_a^+} S_w(t) + \sum\limits_{t \in \mathcal{A}_b^+} S_w(t) }

term_sim(dag, terms, method = "Sim_AIC_2014")

Paper link: https://doi.org/10.1109/tcbb.2013.176.

Sim_Zhang_2006

It uses the IC_Zhang_2006 IC method and uses Lin_1998 similarity method to calculate similarities:

Sim(a,b)=2*ICZhang(c)ICZhang(a)+ICZhang(b) \mathrm{Sim}(a, b) = \frac{2*\mathrm{IC}_\mathrm{Zhang}(c)}{\mathrm{IC}_\mathrm{Zhang}(a) + \mathrm{IC}_\mathrm{Zhang}(b)}

term_sim(dag, terms, method = "Sim_Zhang_2006")

Sim_universal

It uses the IC_universal IC method and uses the Nunivers method to calculate similarities:

Sim(a,b)=2*ICUnivers(c)max{ICUnivers(a),ICUnivers(b)} \mathrm{Sim}(a, b) = \frac{2*\mathrm{IC}_\mathrm{Univers}(c)}{\max \{ \mathrm{IC}_\mathrm{Univers}(a), \mathrm{IC}_\mathrm{Univers}(b) \}}

term_sim(dag, terms, method = "Sim_universal")

Sim_Wang_2007

Similar as the Sim_AIC_2014 method, it is also aggregation from ancestors, but it uses the “S-value” introduced in the IC_Wang_2007 sectionn in 4. Information content.

Sim(a,b)=t𝒜a+𝒜b+(Sa(t)+Sb(t))t𝒜a+Sa(t)+t𝒜b+Sb(t) \mathrm{Sim}(a, b) = \frac{\sum\limits_{t \in \mathcal{A}_a^+ \cap \mathcal{A}_b^+} (S_a(t) + S_b(t)) }{ \sum\limits_{t \in \mathcal{A}_a^+} S_a(t) + \sum\limits_{t \in \mathcal{A}_b^+} S_b(t) }

The contribution of different semantic relations can be set with the contribution_factor parameter. The value should be a named numeric vector where names should cover the relations defined in relations set in create_ontology_DAG(). For example, if there are two relations “relation_a” and “relation_b” set in the DAG, the value for contribution_factor can be set as:

term_sim(dag, terms, method = "Sim_Wang_2007", 
    control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))

By default 0.8 is set for “is_a” and 0.6 for “part_of”.

If you are not sure what types of relations have been set, simply type the dag object. The relation types will be printed there.

Paper link: https://doi.org/10.1093/bioinformatics/btm087.

Sim_GOGO_2018

It is very similar as Sim_Wang_2007 except there is a correction for the contribution factor. When calculating the “S-value” introduced in the IC_Wang_2007 sectionn in 4. Information content, for a parent and a child, the weight variable wew_e is directly determined by the relation type, e.g, 0.8 for “is_a”. In Sim_GOGO_2018, the number of child terms is also considered for wew_e:

we=1c+|𝒞t|+w0 w_e = \frac{1}{c + |\mathcal{C}_t|} + w_0

where |𝒞t||\mathcal{C}_t| is the number of child terms of the parent tt, w0w_0 is the original contribution factor directly assigned for each relation type. cc is selected to ensure we1w_e \leq 1 (assuming minimal number of children is 1), which is normally:

c=max{w0}1max{w0} c = \frac{\max \{w_0\}}{1 - \max \{w_0\}}

By default, 0.4 is assigned for “is_a” and 0.3 is assigned for “part_of”, cc is set to 2/3 (solve 1 = 1/(c + 1) + 0.4).

term_sim(dag, terms, method = "Sim_GOGO_2018", 
    control = list(contribution_factor = c("relation_a" = 0.4, "relation_b" = 0.3)))

Paper link: https://doi.org/10.1038/s41598-018-33219-y.

Sim_Ancestor

This is Jaccard-like coeffcient

Sim(a,b)=|𝒜a+𝒜b+||𝒜a+𝒜b+| \mathrm{Sim}(a, b) = \frac{\left| \mathcal{A}^+_a \cap \mathcal{A}^+_b \right|}{\left| \mathcal{A}^+_a \cup \mathcal{A}^+_b \right|}

term_sim(dag, terms, method = "Sim_Ancestor")

Edge-based methods

Methods introduced in this section relies on the distance between terms. Many methods are defined originally based on the shortest distance between two terms. This section extends them to also support their longest distance via the LCA term.

Sim_Rada_1989

It is based on the distance between term aa and bb. It is defined as:

Sim(a,b)=11+Dsp(a,b) \mathrm{Sim}(a, b) = \frac{1}{1 + D_\mathrm{sp}(a, b)}

which is based on the shortest distance between aa and bb. Optionally, the distance can also be the longest distance via the LCA term cc.

Sim(a,b)=11+lenc(a,b)=11+len(c,a)+len(c,b) \mathrm{Sim}(a, b) = \frac{1}{1 + \mathrm{len}_c(a, b)} = \frac{1}{1 + \mathrm{len}(c, a) + \mathrm{len}(c, b)}

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_Rada_1989",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1109/21.24528.

Sim_Resnik_edge_2005

It is a normalized distance:

Sim(a,b)=1Dsp(a,b)2*δmax \mathrm{Sim}(a, b) = 1 - \frac{D_\mathrm{sp}(a, b)}{2*\delta_\mathrm{max}}

where 2*δmax2*\delta_\mathrm{max} can be thought as the possible maximal distance between two terms in the DAG.

Similarly, the distance can also be the longest distance via LCA, then it is consistent with the definition of δmax\delta_\mathrm{max} which are both based on the longest distance.

Sim(a,b)=1lenc(a,b)2*δmax \mathrm{Sim}(a, b) = 1 - \frac{\mathrm{len}_c(a, b)}{2*\delta_\mathrm{max}}

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_Resnik_edge_2005",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1145/1097047.1097051.

Sim_Leocock_1998

It is similar as the Sim_Resnik_edge_2005 method, but it applies log-transformation on the distance and the depth:

Sim(a,b)=1log(Dsp(a,b))log(2*δmax) \mathrm{Sim}(a, b) = 1 - \frac{\log(D_\mathrm{sp}(a, b))}{\log(2*\delta_\mathrm{max})}

where 2*δmax2*\delta_\mathrm{max} can be thought as the possible maximal distance between two terms in the DAG.

Similarly, the distance can also be the longest distance via LCA, then it is consistent with the definition of δmax\delta_\mathrm{max} which are both based on the longest distance.

Sim(a,b)=1log(lenc(a,b))log(2*δmax) \mathrm{Sim}(a, b) = 1 - \frac{\log(\mathrm{len}_c(a, b))}{\log(2*\delta_\mathrm{max})}

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_Leocock_1998",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://ieeexplore.ieee.org/document/6287675.

Sim_WP_1994

It is based on the depth of the LCA term cc and the longest distance between term aa and bb via cc:

Sim(a,b)=2*δ(c)len(c,a)+len(c,b)+2*δ(c)=2*δ(c)lenc(a,b)+2*δ(c) \begin{align*} \mathrm{Sim}(a, b) & = \frac{2*\delta(c)}{\mathrm{len}(c, a) + \mathrm{len}(c, b) + 2*\delta(c)} \\ & = \frac{2*\delta(c)}{\mathrm{len}_c(a, b) + 2*\delta(c)} \end{align*}

And it can also be written in the Lin_1998 form:

Sim(a,b)=2*δ(c)δ(c)+len(c,a)+δ(c)+len(c,b)=2*δ(c)δc(a)+δc(b) \begin{align*} \mathrm{Sim}(a, b) & = \frac{2*\delta(c)}{\delta(c) + \mathrm{len}(c, a) + \delta(c) + \mathrm{len}(c, b)} \\ & = \frac{2*\delta(c)}{\delta_c(a) + \delta_c(b)} \end{align*}

where in the denominator are the depths of aa and bb via cc.

term_sim(dag, terms, method = "Sim_WP_1994")

Paper link: https://doi.org/10.3115/981732.981751.

Sim_Slimani_2006

It is a correction of the Sim_WP_1994 method. The correction factor for term aa and bb regarding to their LCA term cc is:

Sim(a,b)=CF(a,b)*SimWP(a,b) \mathrm{Sim}(a, b) = \mathrm{CF}(a, b) * \mathrm{Sim}_\mathrm{WP}(a, b)

The correction factor CF(a,b)\mathrm{CF}(a, b) is based whether aa and bb are in ancestor/offspring relationship or not.

CF(a,b)={min{δ(a),δ(b)}δ(c)=min{len(c,a),len(c,b)}𝑎 and 𝑏 are not ancestor-offspring11+|δ(a)δ(b)|=11+len(a,b)𝑎 and 𝑏 are ancestor-offspring \mathrm{CF}(a, b) = \left\{ \begin{array}{ll} \min\{ \delta(a), \delta(b)\} - \delta(c) = \min\{\mathrm{len}(c, a), \mathrm{len}(c, b)\} & \textit{a} \text{ and } \textit{b} \text{ are not ancestor-offspring} \\ \frac{1}{1 + |\delta(a) - \delta(b)|} = \frac{1}{1 + \mathrm{len}(a,b)} & \textit{a} \text{ and } \textit{b} \text{ are ancestor-offspring} \end{array} \right.

term_sim(dag, terms, method = "Sim_Slimani_2006")

Paper link: https://zenodo.org/record/1075130.

Sim_Shenoy_2012

It is also a correction of the Sim_WP_1994 method. The correction factor for term aa and bb is:

CF(a,b)={1𝑎 and 𝑏 are not ancestor-offspringexp(Dsp(a,b)δmax))𝑎 and 𝑏 are ancestor-offspring \mathrm{CF}(a, b) = \left\{ \begin{array}{ll} 1 & \textit{a} \text{ and } \textit{b} \text{ are not ancestor-offspring} \\ \exp(-\frac{D_\mathrm{sp}(a, b)}{\delta_\mathrm{max}})) & \textit{a} \text{ and } \textit{b} \text{ are ancestor-offspring} \end{array} \right.

DspD_\mathrm{sp} can be replaced with len(a,b)\mathrm{len}(a, b) if the longest distance is used.

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_Shenoy_2012",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.48550/arXiv.1211.4709.

Sim_Pekar_2002

It is very similar to the Sim_WP_1994 method:

Sim(a,b)=δ(c)len(c,a)+len(c,b)+δ(c)=δ(c)δ(c)+len(c,a)+δ(c)+len(c,b)δ(c)=δ(c)δc(a)+δc(b)δ(c) \begin{align*} \mathrm{Sim}(a, b) &= \frac{\delta(c)}{\mathrm{len}(c, a) + \mathrm{len}(c, b) + \delta(c)} \\ &= \frac{\delta(c)}{\delta(c) + \mathrm{len}(c, a) + \delta(c) + \mathrm{len}(c, b) - \delta(c)} \\ &= \frac{\delta(c)}{\delta_c(a) + \delta_c(b) - \delta(c)} \end{align*}

And the relationship to SimWP\mathrm{Sim}_\mathrm{WP} is:

SimPekar(a,b)=SimWP(a,b)2SimWP(a,b) \mathrm{Sim}_\mathrm{Pekar}(a, b) = \frac{\mathrm{Sim}_\mathrm{WP}(a, b)}{2 - \mathrm{Sim}_\mathrm{WP}(a, b)}

term_sim(dag, terms, method = "Sim_Pekar_2002")

Paper link: https://aclanthology.org/C02-1090/.

Sim_Stojanovic_2001

It is purely based on the depth of term aa, bb and their LCA term cc.

Sim(a,b)=δ(c)δ(a)+δ(b)δ(c) \mathrm{Sim}(a, b) = \frac{\delta(c)}{\delta(a) + \delta(b) - \delta(c)}

The similarity value might be negative because there is no restrction that the path from root to aa or bb must pass cc.

term_sim(dag, terms, method = "Sim_Stojanovic_2001")

Paper link: https://doi.org/10.1145/500737.500762.

Sim_Wang_edge_2012

It is calculated as:

Sim(a,b)=len(r,c)2lenc(r,a)*lenc(r,b)=δ(c)2δc(a)*δc(b) \begin{align*} \mathrm{Sim}(a, b) & = \frac{\mathrm{len}(r, c)^2}{\mathrm{len}_c(r, a)*\mathrm{len}_c(r, b)} \\ & = \frac{\delta(c)^2}{\delta_c(a)*\delta_c(b)} \end{align*}

where rr is the root term.

term_sim(dag, terms, method = "Sim_Wang_edge_2012")

Paper link: https://doi.org/10.1186/1477-5956-10-s1-s18.

Sim_Zhong_2002

For a term xx, it first calculates a “mile-stone” value based on the depth as

m(x)=2δ(x)1 m(x) = 2^{-\delta(x) - 1}

The the distance bewteen term aa and bb via LCA term cc is:

D(a,b)=D(c,a)+D(c,b)=m(c)m(a)+m(c)+m(b)=2δ(c)2δ(a)12δ(b)1 \begin{align*} D(a, b) & = D(c, a) + D(c, b) \\ & = m(c) - m(a) + m(c) + m(b) \\ & = 2^{-\delta(c)} - 2^{-\delta(a) - 1} - 2^{-\delta(b) - 1} \end{align*}

We can change original δ(a)\delta(a) and δ(b)\delta(b) to δc(a)\delta_c(a) and δc(b)\delta_c(b) to require that the depth to reach aa and bb should go through cc. Then above equation becomes

D(a,b)=2δ(c)2δc(a)12δc(b)1=2δ(c)2δ(c)len(c,a)12δ(c)len(c,b)1=2δ(c)(12len(c,a)12len(c,b)1) \begin{align*} D(a, b) & = 2^{-\delta(c)} - 2^{-\delta_c(a) - 1} - 2^{-\delta_c(b) - 1} \\ & = 2^{-\delta(c)} - 2^{-\delta(c)-\mathrm{len}(c,a)-1} - 2^{-\delta(c)-\mathrm{len}(c,b)-1} \\ & = 2^{-\delta(c)} \left( 1 - 2^{-\mathrm{len}(c,a)-1} - 2^{-\mathrm{len}(c,b)-1} \right) \end{align*}

Then when a=ba = b (the two terms are identical), D(a,b)=0D(a, b) = 0 and when c=rc = r (common ancestor only includes root) and len(r,a)\mathrm{len}(r, a) \to \infty, len(r,b)\mathrm{len}(r, b) \to \infty (root has infinite distance to the terms), D(a,b)D(a, b) reaches maximal of 1. So the similarity

Sim(a,b)=1D(a,b) \mathrm{Sim}(a, b) = 1 - D(a, b)

ranges between 0 and 1.

term_sim(dag, terms, method = "Sim_Zhong_2002")

Paper link: https://doi.org/10.1007/3-540-45483-7_8.

Sim_AlMubaid_2006

It also takes accout of the distance between term aa and bb, as well as the depth of the LCA term cc in the DAG. The distance is calculated as:

D(a,b)=log(1+Dsp(a,b)*(σmaxσ(c))) D(a, b) = \log(1 + D_\mathrm{sp}(a, b)*(\sigma_\mathrm{max} - \sigma(c)))

To scale D(a,b)D(a, b) into the range of [0, 1], we can calculate the smallest value as zero when a=ba = b. D(a,b)D(a, b) reaches maximal when Dsp(a,b)D_\mathrm{sp}(a, b) reach possible maximal which is 2*δmax2*\delta_\mathrm{max}. Then we can define the maximal of D(a,b)D(a, b) as

Dmax=log(1+2*δmax*δmax) D_\mathrm{max} = \log(1 + 2*\delta_\mathrm{max} * \delta_\mathrm{max})

And the similarity is:

Sim(a,b)=1D(a,b)/Dmax \mathrm{Sim}(a, b) = 1 - D(a, b)/D_\mathrm{max}

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_AlMubaid_2006",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1109/IEMBS.2006.259235.

Sim_Li_2003

It is similar to the Sim_AlMubaid_2006 method, but uses a non-linear form:

Sim(a,b)=exp(0.2*Dsp(a,b))*tanh(0.6*δ(c)) \mathrm{Sim}(a, b) = \exp(-0.2*D_\mathrm{sp}(a, b)) * \tanh(0.6*\delta(c))

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_Li_2003",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1109/TKDE.2003.1209005.

Hybrid methods

Hybrid methods use both DAG structure information and IC.

Sim_RSS_2013

The similarity is adjusted by the positions of term aa, bb and the LCA term cc in the DAG. The similarity is defined as:

Sim(a,b)=δmaxδmax+Dsp(a,b)*αα+β \mathrm{Sim}(a, b) = \frac{\delta_\mathrm{max}}{\delta_\mathrm{max} + D_\mathrm{sp}(a, b)} * \frac{\alpha}{\alpha + \beta}

where Dsp(a,b)D_\mathrm{sp}(a, b) can also be the longest distance via LCA. α\alpha and β\beta in the second term are defined as:

α=δ(c)β=min{η(a),η(b)} \begin{align*} \alpha & = \delta(c) \\ \beta & = \min\{ \eta(a), \eta(b) \} \end{align*}

where α\alpha is the depth of LCA, β\beta corresponds to the distance to leaves, which is the smaller height of aa and bb in the DAG.

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

term_sim(dag, terms, method = "Sim_RSS_2013",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: https://doi.org/10.1371/journal.pone.0066745.

Sim_HRSS_2013

It is similar to the Sim_RSS_2013 method, but it uses information content instead of the distance to adjust the similarity.

It first defines the semantic distance between term aa and bb as the sum of the distance to their MICA term cc:

D(a,b)=D(c,a)+D(c,b) D(a, b) = D(c, a) + D(c, b)

And the distance between an ancestor to a term is:

D(c,a)=IC(a)IC(c)D(a,b)=D(c,a)+D(c,b)=IC(a)+IC(b)2*IC(c) \begin{align*} D(c, a) & = \mathrm{IC}(a) - \mathrm{IC}(c) \\ D(a, b) & = D(c, a) + D(c, b) = \mathrm{IC}(a) + \mathrm{IC}(b) - 2*\mathrm{IC}(c) \end{align*}

Similarly, the similarity is also corrected by the position of MICA term and aa, bb in the DAG:

Sim(a,b)=11+D(a,b)*αα+β \mathrm{Sim}(a, b) = \frac{1}{1 + D(a, b)} * \frac{\alpha}{\alpha + \beta}

where

α=IC(c) \alpha = \mathrm{IC}(c)

And beta is the average of the maximal semantic distance of aa and bb to leaves.

β=D(a,la)+D(b,lb)2=IC(la)IC(a)+IC(lb)IC(b)2 \beta = \frac{D(a, l_a) + D(b, l_b)}{2} = \frac{\mathrm{IC}(l_a) - \mathrm{IC}(a) + \mathrm{IC}(l_b) - \mathrm{IC}(b)}{2}

where lal_a or lbl_b is the leaf with the highest IC that aa or bb can reach (i.e. the most informative leaf)

IC(la)=maxz(a)IC(z) \mathrm{IC}(l_a) = \max_{z \in \mathcal{L}(a)} \mathrm{IC}(z)

term_sim(dag, terms, method = "Sim_HRSS_2013")

Paper link: https://doi.org/10.1371/journal.pone.0066745.

Sim_Shen_2010

It is based on the information content of terms on the path connecting term aa and bb via their MICA term cc.

Denote a list of terms a, ..., c, ..., b which are composed by the shortest path from cc to aa and from cc to bb, the distance between aa and bb is the sum of 1/IC1/\mathrm{IC} of the terms on the path. Denote Lc(a,b)L_c(a, b) as the set of terms on the shortest path connecting aa and bb via the MICA term cc, the similarity is:

Sim(a,b)=1arctan(xLc(a,b)1IC(x))π/2 \mathrm{Sim}(a, b) = 1 - \frac{\arctan \left( \sum\limits_{x \in L_c(a, b)} \frac{1}{\mathrm{IC}(x)} \right)}{\pi/2}

The path Lc(a,b)L_c(a, b) can also be defined as the longest path via MICA. The distance parameter controls which type of paths to use.

term_sim(dag, terms, method = "Sim_Shen_2010",
    control = list(distance = "longest_distances_via_LCA"))

Paper link: https://doi.org/10.1109/BIBM.2010.5706623.

Sim_SSDD_2013

It is similar to the Sim_Shen_2010 method which also sums information along the path passing through the LCA term. Instead of summing the information contents, the Sim_SSDD_2013 method sums up a so-called “T-value” which relies on the DAG structure.

Denote Lc(a,b)L_c(a, b) as the set of terms on the shortest path connecting aa and bb via the LCA term cc, the similarity is calculated as:

Sim(a,b)=1arctan(xLc(a,b)T(x))π/2 \mathrm{Sim}(a, b) = 1 - \frac{\arctan \left( \sum\limits_{x \in L_c(a, b)} T(x) \right) }{\pi/2}

The T-value T(x)T(x) depends on the DAG structure which considers both parents and children of xx. The definition of T(x)T(x) is:

T(x)={1if 𝑥 is a root1|𝒫x|t𝒫x(w*T(t))otherwise T(x) = \left\{ \begin{array}{ll} 1 & \text{if }\textit{x}\text{ is a root} \\ \frac{1}{|\mathcal{P}_x|} \sum\limits_{t \in \mathcal{P}_x}(w * T(t)) & \text{otherwise} \end{array} \right.

which means T-value of a term is an average of the weighted T-values of its parents. The weight ww measures the fraction of information a parent tt transmitting to downstream of the DAG via xx, defined as:

w=|Dx+||Dt+| w = \frac{|D_x^+|}{|D_t^+|}

w1w \leq 1 as all offsprings of xx are also offspring of its parent tt.

The path Lc(a,b)L_c(a, b) can also be defined as the longest path via MICA. The distance parameter controls which type of paths to use.

term_sim(dag, terms, method = "Sim_SSDD_2013",
    control = list(distance = "longest_distances_via_LCA"))

Paper link: https://doi.org/10.1016/j.ygeno.2013.04.010.

Sim_Jiang_1997

First semantic distance between term aa and bb via MICA term cc is defined as:

D(a,b)=IC(a)+IC(b)2*IC(c) D(a, b) = \mathrm{IC}(a) + \mathrm{IC}(b) - 2*\mathrm{IC}(c)

Then there are several normalization methods to change the distance to similarity and to scale it into the range of [0, 1].

  • "max": 1D(a,b)2*ICmax1 - \frac{D(a, b)}{2*\mathrm{IC}_\mathrm{max}}
  • "Couto": min{1,D(a,b)ICmax}\min\{ 1, \frac{D(a, b)}{\mathrm{IC}_\mathrm{max}} \}
  • "Lin": 1D(a,b)IC(a)+IC(b)1 - \frac{D(a, b)}{\mathrm{IC}(a) + \mathrm{IC}(b)} which is the same as the Sim_Lin_1998 method
  • "Garla": 1log(D(a,b)+1)log(2*ICmax+1)1 - \frac{\log(D(a, b) + 1)}{\log(2*\mathrm{IC}_\mathrm{max} + 1)}
  • "log-Lin": 1log(D(a,b)+1)log(IC(a)+IC(b)+1)1 - \frac{\log(D(a, b) + 1)}{\log(\mathrm{IC}(a) + \mathrm{IC}(b) + 1)}
  • "Rada": 11+D(a,b)\frac{1}{1 + D(a, b)}

The normalization methods can be set via the parameter norm_method:

term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "max"))
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "Couto"))
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "Lin"))
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "Garla"))
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "log-Lin"))
term_sim(dag, terms, method = "Sim_Jiang_1997", control = list(norm_method = "Rada"))

Paper link: https://aclanthology.org/O97-1002/.

Annotation-count based methods

Denote AA and BB as the sets of items annotated to term aa and bb, and UU as the universe set of all items annotated to the DAG.

Sim_Kappa

The definition of kappa coeffient is a little bit complex. First let’s format the two sets into a contigency table:

In set B
Yes No
In set A Yes a b
No c d

where aa, bb, cc, dd are the numbers of items that fall in each category.

Let’s calculate pobsp_\mathrm{obs} (probability of observed agreement, both yes or both no) and pexpp_\mathrm{exp} (probability of expected agreement) as:

pobs=a+da+b+c+dpYes=a+ba+b+c+d*a+ca+b+c+dpNo=c+da+b+c+d*b+da+b+c+dpexp=pYes+pNo \begin{align*} p_\mathrm{obs} & = \frac{a+d}{a+b+c+d} \\ p_\mathrm{Yes} & = \frac{a+b}{a+b+c+d} * \frac{a+c}{a+b+c+d} \\ p_\mathrm{No} & = \frac{c+d}{a+b+c+d} * \frac{b+d}{a+b+c+d} \\ p_\mathrm{exp} & = p_\mathrm{Yes} + p_\mathrm{No} \end{align*}

where pobsp_\mathrm{obs} is the probability of an item in both sets or neither in both sets, pYesp_\mathrm{Yes} is the probability of an item in both sets by random (by assuming the events of an item in set AA and set BB are independent), pNop_\mathrm{No} is the probability of an item not in the two sets by random, and pexpp_\mathrm{exp} is the probability of an item either both in the two sets or not in the two sets by random.

The kappa coeffcient is calculated as:

Sim(a,b)=Kappa(a,b)=pobspexp1pexp \mathrm{Sim}(a, b) = \mathrm{Kappa}(a, b) = \frac{p_\mathrm{obs} - p_\mathrm{exp}}{1 - p_\mathrm{exp}}

Note the Kappa coeffcient is possible to be negative.

The universe set can be set via the parameter anno_universe. By default it is the total items annotated to the whole DAG.

term_sim(dag, terms, method = "Sim_kappa",
    control = list(anno_universe = ...))

Sim_Jaccard, Sim_Dice and Sim_Overlap

Definitions of the Jaccard, Dice and overlap coeffcients are similar. The Jaccard coeffcient is:

Jaccard(a,b)=|AB||AB| \mathrm{Jaccard}(a, b) = \frac{|A \cap B|}{|A \cup B|}

The Dice coeffcient is:

Dice(a,b)=2*|AB||A|+|B| \mathrm{Dice}(a, b) = \frac{2*|A \cap B|}{|A| + |B|}

The overlap coeffcient is:

Overlap(a,b)=|AB|min{|A|,|B|} \mathrm{Overlap}(a, b) = \frac{|A \cap B|}{\min\{|A|, |B|\}}

Dice and Jaccard coeffcients have a relation of:

Jaccard=Dice2Dice \mathrm{Jaccard} = \frac{\mathrm{Dice}}{2 - \mathrm{Dice}}

The universe set can be set via the parameter anno_universe. By default it is the total items annotated to the whole DAG.

term_sim(dag, terms, method = "Sim_Jaccard",
    control = list(anno_universe = ...))
term_sim(dag, terms, method = "Sim_Dice",
    control = list(anno_universe = ...))
term_sim(dag, terms, method = "Sim_Overlap",
    control = list(anno_universe = ...))

Session Info

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] simona_1.3.12 knitr_1.48   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9            xml2_1.3.6            shape_1.4.6.1        
##  [4] digest_0.6.37         magrittr_2.0.3        evaluate_0.24.0      
##  [7] grid_4.4.1            RColorBrewer_1.1-3    iterators_1.0.14     
## [10] circlize_0.4.16       fastmap_1.2.0         foreach_1.5.2        
## [13] doParallel_1.0.17     jsonlite_1.8.8        GlobalOptions_0.1.2  
## [16] promises_1.3.0        ComplexHeatmap_2.20.0 codetools_0.2-20     
## [19] textshaping_0.4.0     jquerylib_0.1.4       cli_3.6.3            
## [22] shiny_1.9.1           rlang_1.1.4           crayon_1.5.3         
## [25] scatterplot3d_0.3-44  cachem_1.1.0          yaml_2.3.10          
## [28] tools_4.4.1           parallel_4.4.1        colorspace_2.1-1     
## [31] httpuv_1.6.15         GetoptLong_1.0.5      BiocGenerics_0.50.0  
## [34] mime_0.12             R6_2.5.1              png_0.1-8            
## [37] matrixStats_1.3.0     stats4_4.4.1          lifecycle_1.0.4      
## [40] S4Vectors_0.42.1      fs_1.6.4              htmlwidgets_1.6.4    
## [43] IRanges_2.38.1        clue_0.3-65           ragg_1.3.2           
## [46] cluster_2.1.6         pkgconfig_2.0.3       desc_1.4.3           
## [49] pkgdown_2.1.0         bslib_0.8.0           later_1.3.2          
## [52] Rcpp_1.0.13           systemfonts_1.1.0     xfun_0.47            
## [55] xtable_1.8-4          rjson_0.2.22          htmltools_0.5.8.1    
## [58] igraph_2.0.3          rmarkdown_2.28        Polychrome_1.5.1     
## [61] compiler_4.4.1