Semantic similarity
term_sim(dag, terms, method, control = list(), verbose = simona_opt$verbose)
An ontology_DAG
object.
A vector of term names.
A term similarity method. All available methods are in all_term_sim_methods()
.
A list of parameters passing to individual methods. See the subsections.
Whether to print messages.
A numeric symmetric matrix.
The similarity between two terms a
and b
is calculated as the IC of their MICA term c
normalized by the average of the IC of the two terms:
= IC(c)/((IC(a) + IC(b))/2)
sim = 2*IC(c)/(IC(a) + IC(b))
Although any IC method can be used here, in more applications, it is normally used together with the IC_annotation method.
Paper link: doi:10.5555/645527.657297 .
The IC method is fixed to IC_annotation
.
The original Resnik similarity is the IC of the MICA term. There are three ways to normalize the Resnik similarity into the scale of [0, 1]
:
Nunif
= IC(c)/log(N) sim
where N
is the total number of items annotated to the whole DAG, i.e. number of items annotated to the root. Then the IC
of a term with only one item annotated is -log(1/N)
= log(N)` which is the maximal IC value in the DAG.
Nmax
IC_max
is the maximal IC of all terms. If there is a term with only one item annotated, Nmax
is identical to the `Nunif* method.
= IC(c)/IC_max sim
Nunivers
The IC is normalized by the maximal IC of term a
and b
.
= IC(c)/max(IC(a), IC(b)) sim
Paper link: doi:10.1613/jair.514 , doi:10.1186/1471-2105-9-S5-S4 , doi:10.1186/1471-2105-11-562 , doi:10.1155/2013/292063 .
The normalization method can be set with the norm_method
parameter:
term_sim(dag, terms, control = list(norm_method = "Nmax"))
Possible values for the norm_method
parameter are "Nunif", "Nmax", "Nunivers" and "none".
It is calculated as:
= IC(c)/(IC(a) + IC(b) - IC(c)) sim
The relation between FaITH_2010 similarity and Lin_1998 similarity is:
= sim_Lin/(2 - sim_Lin) sim_FaITH
Paper link: doi:10.1007/978-3-642-17746-0_39 .
The IC method is fixed to IC_annotation
.
If thinking Lin_1998 is a measure of how close term a
and b
to their MICA term c
, the relevance method corrects it by multiplying
a factor which considers the specificity of how c
brings the information. The factor is calculated as 1-p(c)
where p(c)
is the annotation-based
probability p(c) = k/N
where k
is the number of items annotated to c
and N
is the total number of items annotated to the DAG. Then
the Relevance semantic similarity is calculated as:
= (1 - p(c)) * IC_Lin
sim = (1 - p(c)) * 2*IC(c)/(IC(a) + IC(b))
Paper link: doi:10.1186/1471-2105-7-302 .
The IC method is fixed to IC_annotation
.
The SimIC method is an improved correction method of the Relevance method because the latter works bad when p(c)
is very small. The SimIC
correction factor for MICA term c
is:
1 - 1/(1 + IC(c))
Then the similarity is:
= (1 - 1/(1 + IC(c))) * IC_Lin
sim = (1 - 1/(1 + IC(c))) * 2*IC(c)/(IC(a) + IC(b))
Paper link: doi:10.48550/arXiv.1001.0958 .
The IC method is fixed to IC_annotation
.
Being different from the "Relevance" and "SimIC_2010" methods that only use the IC of the MICA term, the XGraSM_2013 uses IC of all common ancestor terms of a
and b
.
First it calculates the mean IC of all common ancestor terms with positive IC values:
= mean_t(IC(t)) where t is an ancestor of both a and b, and IC(t) > 0 IC_mean
then similar to the Lin_1998 method, normalize to the average IC of a
and b
:
= IC_mean*2/(IC(a) + IC(b)) sim
Paper link: doi:10.1186/1471-2105-14-284 .
The IC method is fixed to IC_annotation
.
It also selects a subset of common ancestors of terms a
and b
. It only selects common ancestors which can reach a
or b
via one of its child terms
that does not belong to the common ancestors. In other words, from the common ancestor, there exist a path where
the information is uniquely transmitted to a
or b
, not passing the other.
Then the mean IC of the subset common ancestors is calculated and normalized by the Lin_1998 method.
Paper link: doi:10.1016/j.gene.2014.12.062 .
It uses the aggregate information content from ancestors. First define the semantic weight (Sw
) of a term t
in the DAG:
= 1/(1 + exp(-1/IC(t))) Sw
Then calculate the aggregation only in the common ancestors and the aggregationn
in the ancestors of the two terms a
and b
separatedly:
= sum_{t in common ancestors}(Sw(t))
SV_{common ancestors} = sum{a' in a's ancestors}(Sw(a'))
SV_a SV_b = sum{b' in b's ancestors}(Sw(b'))
The similarity is calculated as the ratio between the aggregation on the common ancestors and the average on a
's ancestors and b
's ancestors separatedly.
= 2*SV_{common_ancestors}/(SV_a + SV_b) sim
Paper link: doi:10.1109/tcbb.2013.176 .
It uses the IC_Zhang_2006 IC method and the Lin_1998 method to calculate similarities:
= 2*IC_zhang(c)/(IC_zhang(a) + IC_zhang(b)) sim
It uses the IC_universal IC method and the Nunivers method to calculate similarities:
= IC_universal(c)/max(IC_universal(a), IC_universal(b)) sim
First, S-value of an ancestor term c
on a term a
(S(c->a)
) is calculated (the definition of the S-value can be found in the help page of term_IC()
).
Similar to the Sim_AIC_2014, aggregation only to common ancestors, to a
's ancestors and to b
's ancestors are calculated.
= sum_{c in common ancestors}(S(c->a) + S(c->b))
SV_{common ancestors} = sum{a' in a's ancestors}(S(a'->a))
SV_a SV_b = sum{b' in b's ancestors}(S(b'->b))
Then the similarity is calculated as:
= SV_{common_ancestors}*2/(SV_a + SV_b) sim
Paper link: doi:10.1093/bioinformatics/btm087 .
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_sim(dag, terms, method = "Sim_Wang_2007",
control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
It is very similar as Sim_Wang_2007, but with a corrected contribution factor when calculating the S-value. From a parent term to a child term, Sim_Wang_2007 directly uses a weight for the relation between the parent and the child, e.g. 0.8 for "is_a" relation type and 0.6 for "part_of" relation type. In Sim_GOGO_2018, the weight is also scaled by the total number of children of that parent:
= 1/(c + nc) + w_0 w
where w_0 is the original contribution factor, nc
is the number of child terms of the parent, c
is calculated to ensure that
maximal value of w
is no larger than 1, i.e. c = max(w_0)/(1 - max(w_0))
, assuming minimal value of nc
is 1. By default Sim_GOGO_2018
sets contribution factor of 0.4 for "is_a" and 0.3 for "part_of", then w = 1/(2/3 + nc) + w_0
.
Paper link: doi:10.1038/s41598-018-33219-y .
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_sim(dag, terms, method = "Sim_GOGO_2018",
control = list(contribution_factor = c("relation_a" = 0.4, "relation_b" = 0.3)))
It is based on the distance between term a
and b
. It is defined as:
= 1/(1 + d(a, b)) sim
The distance can be the shortest distance between a
and b
or the longest distance via the LCA term.
Paper link: doi:10.1109/21.24528 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Rada_1989",
control = list(distance = "shortest_distances_via_NCA"))
It is also based on the distance between term a
and b
:
= 1 - d(a, b)/2/max_depth sim
where max_depth
is the maximal depth (maximal distance from root) in the DAG. Similarly, d(a, b)
can be the shortest
distance or the longest distance via LCA.
Paper link: doi:10.1145/1097047.1097051 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Resnik_edge_2005",
control = list(distance = "shortest_distances_via_NCA"))
It is similar as the Sim_Resnik_edge_2005 method, but it applies log-transformation on the distance and the depth:
= 1 - log(d(a, b) + 1)/log(2*max_depth + 1) sim
Paper link: doi:10.1186/1471-2105-13-261 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998",
control = list(distance = "shortest_distances_via_NCA"))
It is based on the depth of the LCA term c
and the longest distance between term a
and b
:
= 2*depth(c)/(len_c(a, b) + 2*depth(c)) sim
where len_c(a, b)
is the longest distance between a
and b
via LCA c
. The denominator in the equation can also be written as:
len_c(a, b) + 2*depth(c) = depth(c) + len(c, a) + depth(c) + len(c, b)
= depth_c(a) + depth_c(b)
where depth_c(a)
is the longest distance from root to a
passing through c
.
Paper link: doi:10.3115/981732.981751 .
It is a correction of the Sim_WP_1994 method. The correction factor for term a
and b
regarding to their LCA t
is:
CF(a, b) = (1-lambda)*(min(depth(a), depth(b)) - depth(c)) +
/(1 + abs(depth(a) - depth(b))) lambda
lambda
takes value of 1 if a
and b
are in ancestor-offspring relation, or else it takes 0.
Paper link: https://zenodo.org/record/1075130.
It is a correction of the Sim_WP_1994 method. The correction factor for term a
and b
is:
CF(a, b) = exp(-lambda*d(a, b)/max_depth)
lambda
takes value of 1 if a
and b
are in ancestor-offspring relation, or else it takes 0. `d(a, b)
Paper link: doi:10.48550/arXiv.1211.4709 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Leocock_1998",
control = list(distance = "shortest_distances_via_NCA"))
It is very similar to the Sim_WP_1994 method:
= depth(c)/(len_c(a, b) + depth(c))
sim = d(root, c)/(d(c, a) + d(c, b) + d(root, c))
where d(a, b)
is the longest distance between a
and b
.
Paper link: https://aclanthology.org/C02-1090/.
It is purely based on the depth of term a
, b
and their LCA c
.
= depth(c)/(depth(a) + depth(b) - depth(c)) sim
The similarity value might be negative because there is no restrction that the path from root to a
or b
must pass c
.
Paper link: doi:10.1145/500737.500762 .
It is calculated as:
= depth(c)^2/depth_c(a)/depth_c(b) sim
where depth_c(a)
is the longest distance between root to a
passing through c
.
Paper link: doi:10.1186/1477-5956-10-s1-s18 .
For a term x
, it first calculates a "mile-stone" value as
m(x) = 0.5/2^depth(x)
The the distance bewteen term a
and b
via LCA term c
is:
D(c, a) + D(c, b) = m(c) - m(a) + m(c) - m(b)
= 2*m(c) - m(a) - m(b)
= 1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b)
We change the original depth(a)
to let it go through LCA term c
when calculating the depth:
1/2^depth(c) - 0.5/2^depth(a) - 0.5/2^depth(b)
= 1/2^depth(c)- 0.5/2^(depth(c) + len(c, a)) - 0.5/2^(depth(c) + len(c, b))
= 1/2^depth(c) * (1 - 1/2^(len(c, a) + 1) - 1/2^(len(c, b) + 1))
= 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1))
And the final similarity is 1 - distance
:
= 1 - 2^-depth(c) * (1 - 2^-(len(c, a) + 1) - 2^-(len(c, b) + 1)) sim
Paper link: doi:10.1007/3-540-45483-7_8 .
There is a parameter depth_via_LCA
that can be set to TRUE
or FALSE
. IF it is set to TRUE
, depth(a)
is re-defined
as should pass the LCA term c
. If it is FALSE
, it goes to the original similarity definition in the paper and note the
similarity might be negative.
term_sim(dag, terms, method = "Sim_Zhong_2002",
control = list(depth_via_LCA = FALSE))
It also takes accout of the distance between term a
and b
, and the depth of the LCA term c
in the DAG.
The distance is calculated as:
D(a, b) = log(1 + d(a, b)*(max_depth - depth(c)))
Here d(a, b)
can be the shortest distance between a
and b
or the longst distance via LCA c
.
Then the distance is transformed into the similarity value scaled by the possible maximal and minimal values of D(a, b)
from the DAG:
= log(1 + 2*max_depth * max_depth) D_max
And the minimal value of D(a, b)
is zero when a
is identical to b
. Then the similarity value is scaled as:
= 1 - D(a, b)/D_max sim
Paper link: doi:10.1109/IEMBS.2006.259235 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_AlMubaid_2006",
control = list(distance = "shortest_distances_via_NCA"))
It is similar to the Sim_AlMubaid_2006 method, but uses a non-linear form:
= exp(0.2*d(a, b)) * atan(0.6*depth(c)) sim
where d(a, b)
can be the shortest distance or the longest distance via LCA.
Paper link: doi:10.1109/TKDE.2003.1209005 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_Li_2003",
control = list(distance = "shortest_distances_via_NCA"))
The similarity is adjusted by the positions of term a
, b
and the LCA term c
in the DAG. The similarity is defined as:
= max_depth/(max_depth + d(a, b)) * alpha/(alpha + beta) sim
where d(a, b)
is the distance between a
and b
which can be the shortest distance or the longest distance via LCA.
In the tuning factor, alpha
is the distance of LCA to root, which is depth(c)
. beta
is the distance to leaves, which
is the minimal distance (or the minimal height) of term a
and b
:
/(alpha + beta) = depth(c)/(depth(c) + min(height(a), height(b))) alpha
Paper link: doi:10.1371/journal.pone.0066745 .
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
term_sim(dag, terms, method = "Sim_RSS_2013",
control = list(distance = "shortest_distances_via_NCA"))
It is similar as the Sim_RSS_2013 method, but it uses information content instead of the distance to adjust the similarity.
It first defines the semantic distance between term a
and b
as the sum of the distance to their MICA term c
:
D(a, b) = D(c, a) + D(c, b)
And the distance between an ancestor to a term is:
D(c, a) = IC(a) - IC(c) # if c is an ancestor of a
D(a, b) = D(c, a) + D(c, b) = IC(a) + IC(b) - 2*IC(c) # if c is the MICA of a and b
Similarly, the similarity is also corrected by the position of MICA term and a
and b
in the DAG:
1/(1 + D(a, b)) * alpha/(alph + beta)
Now alpha
is the IC of the MICA term:
= IC(c) alpha
And beta
is the average of the maximal semantic distance of a
and b
to leaves.
= 0.5*(IC(l_a) - IC(a) + IC(l_b) - IC(b)) beta
where l_a
is the leaf that a
can reach with the highest IC (i.e. most informative leaf), and so is l_b
.
Paper link: doi:10.1371/journal.pone.0066745 .
It is based on the information content of terms on the path connecting term a
and b
via their MICA term c
.
Denote a list of terms a, ..., c, ..., b
which are composed by the shortest path from a
to c
and from b
to c
, the difference
between a
and b
is the sum of 1/IC
of the terms on the path:
in the path}(1/IC(x)) sum_{x
Then the distance is scaled into [0, 1]
by an arctangent tarnsformation:
atan(sum_{x in the path}(1/IC(x)))/(pi/2)
And finally the similarity is:
= 1 - atan(sum_{x in the path}(1/IC(x)))/(pi/2) sim
Paper link: doi:10.1109/BIBM.2010.5706623 .
It is similar as the Sim_Shen_2010 which also sums content along the path passing through LCA term. Instead of summing the information content, the Sim_SSDD_2013 sums up a so-called "T-value":
= 1 - atan(sum_{x in the path}(T(x)))/(pi/2) sim
Each term has a T-value and it measures the semantic content a term averagely inherited from its parents
and distributed to its offsprings. The T-value of root is 1. Assume a term t
has two parents p1
and p1
,
The T-value for term t
is averaged from its
*T(p1) + w2*T(p2))/2 (w1
Since the parent may have other child terms, a factor w1
or w2
is multiplied to T(p1)
and T(p2)
. Taking
p1
as an example, it has n_p
offsprings (including itself) and t
has n_t
offsprings (including itself),
this means n_t/n_p
of information is transmitted from p1
to downstream via t
, thus w1
is defined as n_t/n_p
.
Paper link: doi:10.1016/j.ygeno.2013.04.010 .
First semantic distance between term a
and b
via MICA term c
is defined as:
D(a, b) = IC(a) + IC(b) - 2*IC(c)
Then there are several normalization method to change the distance to similarity and to scale it into the range of [0, 1]
.
max: 1 - D(a, b)/2/IC_max
Couto: min(1, D(a, b)/IC_max)
Lin: 1 - D(a, b)/(IC(a) + IC(b))
which is the same as the Sim_Lin_1998 method
Garla: 1 - log(D(a, b) + 1)/log(2*IC_max + 1)
log-Lin: 1 - log(D(a, b) + 1)/log(IC(a) + IC(b) + 1)
Rada: 1/(1 + D(a, b))
Paper link: https://aclanthology.org/O97-1002/.
There is a parameter norm_method
which takes value in "max", "Couto", "Lin", "Carla", "log-Lin", "Rada":
term_sim(dag, terms, method = "Sim_Jiang_1997",
control = list(norm_method = "Lin"))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the kappa coeffcient
of the two sets.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_kappa",
control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the Jaccard coeffcient
of the two sets, defined as length(intersect(A, B))/length(union(A, B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Jaccard",
control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the Dice coeffcient
of the two sets, defined as 2*length(intersect(A, B))/(length(A) + length(B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Dice",
control = list(anno_universe = ...))
Denote two sets A
and B
as the items annotated to term a
and b
. The similarity value is the overlap coeffcient
of the two sets, defined as length(intersect(A, B))/min(length(A), length(B))
.
The universe or the background can be set via parameter anno_universe
:
term_sim(dag, terms, method = "Sim_Overlap",
control = list(anno_universe = ...))
parents = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
"a" = 1:3,
"b" = 3:4,
"c" = 5,
"d" = 7,
"e" = 4:7,
"f" = 8
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
term_sim(dag, dag_all_terms(dag), method = "Sim_Lin_1998")
#> term_sim_method: Sim_Lin_1998
#> IC_method: IC_annotation
#> collecting all ancestors of input terms ...
#>
#> going through 0 / 6 ancestors ...
#>
#> going through 6 / 6 ancestors ... Done.
#> a b c d e f
#> a 1 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> b 0 1.0000000 0.5866099 0.3437110 0.5866099 0.2430647
#> c 0 0.5866099 1.0000000 0.2766917 1.0000000 0.2075187
#> d 0 0.3437110 0.2766917 1.0000000 0.2766917 0.8000000
#> e 0 0.5866099 1.0000000 0.2766917 1.0000000 0.2075187
#> f 0 0.2430647 0.2075187 0.8000000 0.2075187 1.0000000