Semantic similarity
term_sim(dag, terms, method, control = list(), verbose = simona_opt$verbose)An ontology_DAG object.
A vector of term names.
A term similarity method. All available methods are in all_term_sim_methods().
A list of parameters passing to individual methods. See the subsections.
Whether to print messages.
A numeric symmetric matrix.
The similarity between two terms a and b is calculated as the IC of their MICA term c normalized by the average of the IC of the two terms:
sim = IC(c)/((IC(a) + IC(b))/2)
= 2*IC(c)/(IC(a) + IC(b))Although any IC method can be used here, in more applications, it is normally used together with the IC_annotation method.
Paper link: doi:10.5555/645527.657297 .
The IC method is fixed to IC_annotation.
The original Resnik similarity is the IC of the MICA term. There are three ways to normalize the Resnik similarity into the scale of [0, 1]:
Nunif
sim = IC(c)/log(N)where N is the total number of items annotated to the whole DAG, i.e. number of items annotated to the root. Then the IC
of a term with only one item annotated is -log(1/N) = log(N)` which is the maximal IC value in the DAG.
Nmax
IC_max is the maximal IC of all terms. If there is a term with only one item annotated, Nmax is identical to the `Nunif* method.
sim = IC(c)/IC_maxNunivers
The IC is normalized by the maximal IC of term a and b.
sim = IC(c)/max(IC(a), IC(b))Paper link: doi:10.1613/jair.514 , doi:10.1186/1471-2105-9-S5-S4 , doi:10.1186/1471-2105-11-562 , doi:10.1155/2013/292063 .
The normalization method can be set with the norm_method parameter:
term_sim(dag, terms, control = list(norm_method = "Nmax"))Possible values for the norm_method parameter are "Nunif", "Nmax", "Nunivers" and "none".
It is calculated as:
sim = IC(c)/(IC(a) + IC(b) - IC(c))The relation between FaITH_2010 similarity and Lin_1998 similarity is:
sim_FaITH = sim_Lin/(2 - sim_Lin)Paper link: doi:10.1007/978-3-642-17746-0_39 .
The IC method is fixed to IC_annotation.
If thinking Lin_1998 is a measure of how close term a and b to their MICA term c, the relevance method corrects it by multiplying
a factor which considers the specificity of how c brings the information. The factor is calculated as 1-p(c) where p(c) is the annotation-based
probability p(c) = k/N where k is the number of items annotated to c and N is the total number of items annotated to the DAG. Then
the Relevance semantic similarity is calculated as:
sim = (1 - p(c)) * IC_Lin
= (1 - p(c)) * 2*IC(c)/(IC(a) + IC(b))Paper link: doi:10.1186/1471-2105-7-302 .
The IC method is fixed to IC_annotation.
The SimIC method is an improved correction method of the Relevance method because the latter works bad when p(c) is very small. The SimIC
correction factor for MICA term c is:
1 - 1/(1 + IC(c))Then the similarity is:
sim = (1 - 1/(1 + IC(c))) * IC_Lin
= (1 - 1/(1 + IC(c))) * 2*IC(c)/(IC(a) + IC(b))Paper link: doi:10.48550/arXiv.1001.0958 .
The IC method is fixed to IC_annotation.
Being different from the "Relevance" and "SimIC_2010" methods that only use the IC of the MICA term, the XGraSM_2013 uses IC of all common ancestor terms of a and b.
First it calculates the mean IC of all common ancestor terms with positive IC values:
IC_mean = mean_t(IC(t)) where t is an ancestor of both a and b, and IC(t) > 0then similar to the Lin_1998 method, normalize to the average IC of a and b:
sim = IC_mean*2/(IC(a) + IC(b))Paper link: doi:10.1186/1471-2105-14-284 .
The IC method is fixed to IC_annotation.
It also selects a subset of common ancestors of terms a and b. It only selects common ancestors which can reach a or b via one of its child terms
that does not belong to the common ancestors. In other words, from the common ancestor, there exist a path where
the information is uniquely transmitted to a or b, not passing the other.
Then the mean IC of the subset common ancestors is calculated and normalized by the Lin_1998 method.
Paper link: doi:10.1016/j.gene.2014.12.062 .
It uses the aggregate information content from ancestors. First define the semantic weight (Sw) of a term t in the DAG:
Sw = 1/(1 + exp(-1/IC(t)))Then calculate the aggregation only in the common ancestors and the aggregationn
in the ancestors of the two terms a and b separatedly:
SV_{common ancestors} = sum_{t in common ancestors}(Sw(t))
SV_a = sum{a' in a's ancestors}(Sw(a'))
SV_b = sum{b' in b's ancestors}(Sw(b'))The similarity is calculated as the ratio between the aggregation on the common ancestors and the average on a's ancestors and b's ancestors separatedly.
sim = 2*SV_{common_ancestors}/(SV_a + SV_b)Paper link: doi:10.1109/tcbb.2013.176 .
It uses the IC_Zhang_2006 IC method and the Lin_1998 method to calculate similarities:
sim = 2*IC_zhang(c)/(IC_zhang(a) + IC_zhang(b))It uses the IC_universal IC method and the Nunivers method to calculate similarities:
sim = IC_universal(c)/max(IC_universal(a), IC_universal(b))First, S-value of an ancestor term c on a term a (S(c->a)) is calculated (the definition of the S-value can be found in the help page of term_IC()).
Similar to the Sim_AIC_2014, aggregation only to common ancestors, to a's ancestors and to b's ancestors are calculated.
SV_{common ancestors} = sum_{c in common ancestors}(S(c->a) + S(c->b))
SV_a = sum{a' in a's ancestors}(S(a'->a))
SV_b = sum{b' in b's ancestors}(S(b'->b))Then the similarity is calculated as:
sim = SV_{common_ancestors}*2/(SV_a + SV_b)Paper link: doi:10.1093/bioinformatics/btm087 .
The contribution of different semantic relations can be set with the contribution_factor parameter. The value should be a named numeric
vector where names should cover the relations defined in relations set in create_ontology_DAG(). For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor can be set as:
term_sim(dag, terms, method = "Sim_Wang_2007",
control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))It is very similar as Sim_Wang_2007, but with a corrected contribution factor when calculating the S-value. From a parent term to a child term, Sim_Wang_2007 directly uses a weight for the relation between the parent and the child, e.g. 0.8 for "is_a" relation type and 0.6 for "part_of" relation type. In Sim_GOGO_2018, the weight is also scaled by the total number of children of that parent:
w = 1/(c + nc) + w_0where w_0 is the original contribution factor, nc is the number of child terms of the parent, c is calculated to ensure that
maximal value of w is no larger than 1, i.e. c = max(w_0)/(1 - max(w_0)), assuming minimal value of nc is 1. By default Sim_GOGO_2018
sets contribution factor of 0.4 for "is_a" and 0.3 for "part_of", then w = 1/(2/3 + nc) + w_0.
Paper link: doi:10.1038/s41598-018-33219-y .
The contribution of different semantic relations can be set with the contribution_factor parameter. The value should be a named numeric
vector where names should cover the relations defined in relations set in create_ontology_DAG(). For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor can be set as:
term_sim(dag, terms, method = "Sim_GOGO_2018",
control = list(contribution_factor = c("relation_a" = 0.4, "relation_b" = 0.3)))It is based on the distance between term a and b. It is defined as:
sim = 1/(1 + d(a, b))The distance can be the shortest distance between a and b or the longest distance via the LCA term.
Paper link: doi:10.1109/21.24528 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Rada_1989",
control = list(distance = "shortest_distances_via_NCA"))It is also based on the distance between term a and b:
sim = 1 - d(a, b)/2/max_depthwhere max_depth is the maximal depth (maximal distance from root) in the DAG. Similarly, d(a, b) can be the shortest
distance or the longest distance via LCA.
Paper link: doi:10.1145/1097047.1097051 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Resnik_edge_2005",
control = list(distance = "shortest_distances_via_NCA"))It is similar as the Sim_Resnik_edge_2005 method, but it applies log-transformation on the distance and the depth:
sim = 1 - log(d(a, b) + 1)/log(2*max_depth + 1)Paper link: doi:10.1186/1471-2105-13-261 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998",
control = list(distance = "shortest_distances_via_NCA"))It is based on the depth of the LCA term c and the longest distance between term a and b:
sim = 2*depth(c)/(len_c(a, b) + 2*depth(c))where len_c(a, b) is the longest distance between a and b via LCA c. The denominator in the equation can also be written as:
len_c(a, b) + 2*depth(c) = depth(c) + len(c, a) + depth(c) + len(c, b)
= depth_c(a) + depth_c(b)where depth_c(a) is the longest distance from root to a passing through c.
Paper link: doi:10.3115/981732.981751 .
It is a correction of the Sim_WP_1994 method. The correction factor for term a and b regarding to their LCA t is:
CF(a, b) = (1-lambda)*(min(depth(a), depth(b)) - depth(c)) +
lambda/(1 + abs(depth(a) - depth(b)))lambda takes value of 1 if a and b are in ancestor-offspring relation, or else it takes 0.
Paper link: https://zenodo.org/record/1075130.
It is a correction of the Sim_WP_1994 method. The correction factor for term a and b is:
CF(a, b) = exp(-lambda*d(a, b)/max_depth)lambda takes value of 1 if a and b are in ancestor-offspring relation, or else it takes 0. `d(a, b)
Paper link: doi:10.48550/arXiv.1211.4709 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998",
control = list(distance = "shortest_distances_via_NCA"))It is very similar to the Sim_WP_1994 method:
sim = depth(c)/(len_c(a, b) + depth(c))
= d(root, c)/(d(c, a) + d(c, b) + d(root, c))where d(a, b) is the longest distance between a and b.
Paper link: https://aclanthology.org/C02-1090/.
It is purely based on the depth of term a, b and their LCA c.
sim = depth(c)/(depth(a) + depth(b) - depth(c))The similarity value might be negative because there is no restrction that the path from root to a or b must pass c.
Paper link: doi:10.1145/500737.500762 .
It is calculated as:
sim = depth(c)^2/depth_c(a)/depth_c(b)where depth_c(a) is the longest distance between root to a passing through c.
Paper link: doi:10.1186/1477-5956-10-s1-s18 .
For a term x, it first calculates a "mile-stone" value as
m(x) = 0.5/2^depth(x)The the distance bewteen term a and b via LCA term c is:
D(c, a) + D(c, b) = m(c) - m(a) + m(c) - m(b)
= 2*m(c) - m(a) - m(b)
= 1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b)We change the original depth(a) to let it go through LCA term c when calculating the depth:
1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b)
= 1/2^depth(c)- 0.5/2^(depth(c) + len(c, a)) - 0.5/2^(depth(c) + len(c, b))
= 1/2^depth(c) * (1 - 1/2^(len(c, a) + 1) - 1/2^(len(c, b) + 1))
= 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1))And the final similarity is 1 - distance:
sim = 1 - 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1))Paper link: doi:10.1007/3-540-45483-7_8 .
There is a parameter depth_via_LCA that can be set to TRUE or FALSE. IF it is set to TRUE, depth(a) is re-defined
as should pass the LCA term c. If it is FALSE, it goes to the original similarity definition in the paper and note the
similarity might be negative.
term_sim(dag, terms, method = "Sim_Zhong_2002",
control = list(depth_via_LCA = FALSE))It also takes accout of the distance between term a and b, and the depth of the LCA term c in the DAG.
The distance is calculated as:
D(a, b) = log(1 + d(a, b)*(max_depth - depth(c)))Here d(a, b) can be the shortest distance between a and b or the longst distance via LCA c.
Then the distance is transformed into the similarity value scaled by the possible maximal and minimal values of D(a, b) from the DAG:
D_max = log(1 + 2*max_depth * max_depth)And the minimal value of D(a, b) is zero when a is identical to b. Then the similarity value is scaled as:
sim = 1 - D(a, b)/D_maxPaper link: doi:10.1109/IEMBS.2006.259235 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_AlMubaid_2006",
control = list(distance = "shortest_distances_via_NCA"))It is similar to the Sim_AlMubaid_2006 method, but uses a non-linear form:
sim = exp(0.2*d(a, b)) * atan(0.6*depth(c))where d(a, b) can be the shortest distance or the longest distance via LCA.
Paper link: doi:10.1109/TKDE.2003.1209005 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Li_2003",
control = list(distance = "shortest_distances_via_NCA"))The similarity is adjusted by the positions of term a, b and the LCA term c in the DAG. The similarity is defined as:
sim = max_depth/(max_depth + d(a, b)) * alpha/(alpha + beta)where d(a, b) is the distance between a and b which can be the shortest distance or the longest distance via LCA.
In the tuning factor, alpha is the distance of LCA to root, which is depth(c). beta is the distance to leaves, which
is the minimal distance (or the minimal height) of term a and b:
alpha/(alpha + beta) = depth(c)/(depth(c) + min(height(a), height(b)))Paper link: doi:10.1371/journal.pone.0066745 .
There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_RSS_2013",
control = list(distance = "shortest_distances_via_NCA"))It is similar as the Sim_RSS_2013 method, but it uses information content instead of the distance to adjust the similarity.
It first defines the semantic distance between term a and b as the sum of the distance to their MICA term c:
D(a, b) = D(c, a) + D(c, b)And the distance between an ancestor to a term is:
D(c, a) = IC(a) - IC(c) # if c is an ancestor of a
D(a, b) = D(c, a) + D(c, b) = IC(a) + IC(b) - 2*IC(c) # if c is the MICA of a and bSimilarly, the similarity is also corrected by the position of MICA term and a and b in the DAG:
1/(1 + D(a, b)) * alpha/(alph + beta)Now alpha is the IC of the MICA term:
alpha = IC(c)And beta is the average of the maximal semantic distance of a and b to leaves.
beta = 0.5*(IC(l_a) - IC(a) + IC(l_b) - IC(b))where l_a is the leaf that a can reach with the highest IC (i.e. most informative leaf), and so is l_b.
Paper link: doi:10.1371/journal.pone.0066745 .
It is based on the information content of terms on the path connecting term a and b via their MICA term c.
Denote a list of terms a, ..., c, ..., b which are composed by the shortest path from a to c and from b to c, the difference
between a and b is the sum of 1/IC of the terms on the path:
sum_{x in the path}(1/IC(x))Then the distance is scaled into [0, 1] by an arctangent tarnsformation:
atan(sum_{x in the path}(1/IC(x)))/(pi/2)And finally the similarity is:
sim = 1 - atan(sum_{x in the path}(1/IC(x)))/(pi/2)Paper link: doi:10.1109/BIBM.2010.5706623 .
It is similar as the Sim_Shen_2010 which also sums content along the path passing through LCA term. Instead of summing the information content, the Sim_SSDD_2013 sums up a so-called "T-value":
sim = 1 - atan(sum_{x in the path}(T(x)))/(pi/2)Each term has a T-value and it measures the semantic content a term averagely inherited from its parents
and distributed to its offsprings. The T-value of root is 1. Assume a term t has two parents p1 and p1,
The T-value for term t is averaged from its
(w1*T(p1) + w2*T(p2))/2Since the parent may have other child terms, a factor w1 or w2 is multiplied to T(p1) and T(p2). Taking
p1 as an example, it has n_p offsprings (including itself) and t has n_t offsprings (including itself),
this means n_t/n_p of information is transmitted from p1 to downstream via t, thus w1 is defined as n_t/n_p.
Paper link: doi:10.1016/j.ygeno.2013.04.010 .
First semantic distance between term a and b via MICA term c is defined as:
D(a, b) = IC(a) + IC(b) - 2*IC(c)Then there are several normalization method to change the distance to similarity and to scale it into the range of [0, 1].
max: 1 - D(a, b)/2/IC_max
Couto: min(1, D(a, b)/IC_max)
Lin: 1 - D(a, b)/(IC(a) + IC(b)) which is the same as the Sim_Lin_1998 method
Garla: 1 - log(D(a, b) + 1)/log(2*IC_max + 1)
log-Lin: 1 - log(D(a, b) + 1)/log(IC(a) + IC(b) + 1)
Rada: 1/(1 + D(a, b))
Paper link: https://aclanthology.org/O97-1002/.
There is a parameter norm_method which takes value in "max", "Couto", "Lin", "Carla", "log-Lin", "Rada":
term_sim(dag, terms, method = "Sim_Jiang_1997",
control = list(norm_method = "Lin"))Denote two sets A and B as the items annotated to term a and b. The similarity value is the kappa coeffcient
of the two sets.
The universe or the background can be set via parameter anno_universe:
term_sim(dag, terms, method = "Sim_kappa",
control = list(anno_universe = ...))Denote two sets A and B as the items annotated to term a and b. The similarity value is the Jaccard coeffcient
of the two sets, defined as length(intersect(A, B))/length(union(A, B)).
The universe or the background can be set via parameter anno_universe:
term_sim(dag, terms, method = "Sim_Jaccard",
control = list(anno_universe = ...))Denote two sets A and B as the items annotated to term a and b. The similarity value is the Dice coeffcient
of the two sets, defined as 2*length(intersect(A, B))/(length(A) + length(B)).
The universe or the background can be set via parameter anno_universe:
term_sim(dag, terms, method = "Sim_Dice",
control = list(anno_universe = ...))Denote two sets A and B as the items annotated to term a and b. The similarity value is the overlap coeffcient
of the two sets, defined as length(intersect(A, B))/min(length(A), length(B)).
The universe or the background can be set via parameter anno_universe:
term_sim(dag, terms, method = "Sim_Overlap",
control = list(anno_universe = ...))parents = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
"a" = 1:3,
"b" = 3:4,
"c" = 5,
"d" = 7,
"e" = 4:7,
"f" = 8
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
term_sim(dag, dag_all_terms(dag), method = "Sim_Lin_1998")
#> term_sim_method: Sim_Lin_1998
#> IC_method: IC_annotation
#> collecting all ancestors of input terms ...
#>
#> going through 0 / 6 ancestors ...
#>
#> going through 6 / 6 ancestors ... Done.
#> a b c d e f
#> a 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> b 0 1.0000000 0.5866099 0.3437110 0.5866099 0.2430647
#> c 0 0.5866099 1.0000000 0.2766917 1.0000000 0.2075187
#> d 0 0.3437110 0.2766917 1.0000000 0.2766917 0.8000000
#> e 0 0.5866099 1.0000000 0.2766917 1.0000000 0.2075187
#> f 0 0.2430647 0.2075187 0.8000000 0.2075187 1.0000000