Semantic similarity between two groups of terms
group_sim(
dag,
group1,
group2,
method,
control = list(),
verbose = simona_opt$verbose
)
An ontology_DAG
object.
A vector of term names or a list of term vectors.
A vector of term names or a list of term vectors..
A group similarity method. All available methods are in all_group_sim_methods()
.
A list of parameters passing to individual methods. The term similarity method is controlled by term_sim_method
and the IC method is controlled by IC_method
. Other term similarity related parameters can also be specified in control
. See the subsections.
Whether to print messages.
A numeric scalar, a numeric vector or a matrix depending on the dat type of group1
and group2
.
If annotation
is set in create_ontology_DAG()
and you want to directly calculate semantic similarity between two
annotated items, you can first get the associated terms of the two items by annotated_terms()
:
= annotated_terms(dag, item1)[[1]]
group1 = annotated_terms(dag, item2)[[1]]
group2 group_sim(dag, group1, group2, ...)
Denote S(a, b)
as the semantic similarity between terms a
and b
where a
is from group1
and b
is from group2
,
The similarity between group1
and group2
is the average similarity of every pair of individual terms in the two groups:
= mean_{a in group1, b in group2}(S(a, b)) group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Pape link: doi:10.1093/bioinformatics/btg153 .
This is the maximal S(a, b)
among all pairs of terms in group1
and group2
:
= max_{a in group1, b in group2}(S(a, b)) group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_max"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/TCBB.2005.50 .
BMA stands for "best-match average". First define similarity of a term to a group of terms as
S(x, group) = max_{y in group}(x, y)
which is the most similar terms in group
to x
.
Then the BMA similarity is calculated as:
= 0.5*(mean_{a in group1}(S(a, group2)) + mean_{b in group2}(S(b, group1))) group_sim
So it is the average of the similarity of every term in group1
to the whole group2
and every term in group2
to the whole group1
.
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1155/2012/975783 .
BMM stands for "best-match max". It is defined as:
= max(mean_{a in group1}(S(a, group2)), mean_{b in group2}(S(b, group1))) group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-7-302 .
ABM stands for "average best-match". It is defined as:
= (sum_{a in group1}(S(a, group2)) + sum_{b in group2}(S(b, group1)))/(n1 + n2) group_sim
where n1
and n2
are the number of terms in group1
and group2
.
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-14-284 .
First define the distance of a term to a group of terms:
D(x, group) = 1 - S(x, group)
Then the Hausdorff distance between two groups are:
= max(max_{a in group1}(D(a, group2)), max_{b in group2}(D(b, group1))) HDF
This final similarity is:
= 1 - HDF group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:
= max(mean_{a in group1}(D(a, group2)), mean_{b in group2}(D(b, group1))) MHDF
This final similarity is:
= 1 - MHDF group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/ICPR.1994.576361 .
It is defined as:
= 0.5*(sqrt(mean_{a in group1}(D(a, group2)^2)) + sqrt(mean_{b in group2}(D(b, group1)^2)))
VHDF = 1 - VHDF group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1073/pnas.0702965104 .
The similarity is:
= exp(-HDF(group1, group2)) group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1186/1471-2105-8-166 .
Similar to VHDF, but it directly uses the similarity:
= 0.5*(sqrt(mean_{a in group1}(S(a, group2)^2)) + sqrt(mean_{b in group2}(S(b, group1)^2))) group_sim
The term semantic similarity method and the IC method can be set via control
argument:
group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014"
control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.
Other parameters for the term_sim_method
can also be set in the control
list.
Paper link: doi:10.1109/TCBB.2014.2343963 .
It is based on the average distances between every pair of terms in the two groups:
exp(-mean_{a in group1, b in group2}(d(a, b)))
d(a, b)
is the distance between a
and b
, which can be the shortest distance between the two terms or
the longest distnace via LCA.
There is a parameter distance
which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":
group_sim(dag, group1, group2, method = "GroupSim_SimALN",
control = list(distance = "shortest_distances_via_NCA"))
Paper link: doi:10.1109/ISCC.2008.4625763 .
Denote A
and B
as the two sets of ancestors terms of terms in group1
and group2
respectively,
the SimGIC is:
= sum_{x in intersect(A, B)}(IC(x))/sum_{x in union(A, B)}(IC(x)) group_sim
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimGIC, it calculates the Dice coeffcient:
= 2*sum_{x in intersect(A, B)}(IC(x))/(sum_{x in A}(IC(x)) + sum_{x in B}(IC(x))) group_sim
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimGIC, it is calculated as:
= sum_{x in intersect(A, B)}(IC(x))/max(sum_{x in A}(IC(x)), sum_{x in B}(IC(x))) group_sim
IC method can be set via control = list(IC_method = ...)
.
It is only based on the number of terms. A
is the set of all ancestors of group1 terms and B
is the set of all ancestors of group2 terms.
= length(intersect(A, B))/length(union(A, B)) group_sim
It is based on the dot product of two vectors p
and q
which correspond to terms in group1
and group2
.
p
and q
have the same length as the total number of terms. Value of position i in p
or q
corresponds to term t
. The value
takes IC(t)
if t
is an ancestor of any term in p
or q
, and the value takes zero if t
is not. The
similarity betweem group1
terms and group2
terms is calculated as:
<p,q>/||p||/||q||
where <p,q>
is the dot product between the two, and ||p||
or ||q||
is the norm of the vector.
The equation can be written as:
= sum_{x in intersect(A, B)}(IC(x)^2) /
group_sim sqrt(sum_{x in A}(IC(x)^2)) /
sqrt(sum_{x in B}(IC(x)^2))
IC method can be set via control = list(IC_method = ...)
.
Similar as GroupSim_SimCOU, the similarity is:
<p,q>/(||p||^2 + ||q||^2 - <p,q>)
And it can be rewritten as:
= sum_{x in intersect(A, B)}(IC(x)^2) /
group_sim in A}(IC(x)^2) + sum_{x in B}(IC(x)^2) - sum_{x in intersect(A, B)}(IC(x)^2)) (sum_{x
IC method can be set via control = list(IC_method = ...)
.
It is the longest depth for the terms in intersect(A, B)
.
= max(depth(intersect(A, B))) group_sim
It is a normalized version of GroupSim_SimLP:
= max(depth(intersect(A, B)))/max_depth group_sim
Since the minimal depth is zero for root.
It is based on the annotated items. Denote sigma(t)
as the total annotated items of t
. The similarity is calculated as
= log(C/sigma_max)/log(sigma_min/sigma_max) group_sim
where C
is min(sigma_{x in intersect(A, B)}(x))
, i.e., the minimal sigma in the intersection of group1 and group2. Note
Now A
and B
are just two sets of terms in group1
and group2
.
sigma_max
is the total number of items annotated to the DAG, sig_min
is the minimal number of items annotated to a term, which
is mostly 1.
A
and B
are just two sets of terms in group1
and group2
. The similarity is calculated as:
= max_{t in intersect(A, B)}(1 - sigma(t)/N) group_sim
Say A
is the set of items annotated to terms in group1
and B
is the set of items annotated to group2
.
This is the Jaccard coeffcient between two sets.
The universe/background can be set via control = list(universe = ...)
.
It is the Dice coeffcient between A
and B
.
The universe/background can be set via control = list(universe = ...)
.
parents = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
"a" = c("t1", "t2", "t3"),
"b" = c("t3", "t4"),
"c" = "t5",
"d" = "t7",
"e" = c("t4", "t5", "t6", "t7"),
"f" = "t8"
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
group_sim(dag, c("c", "e"), c("d", "f"),
method = "GroupSim_pairwise_avg",
control = list(term_sim_method = "Sim_Lin_1998")
)
#> group_sim_method: GroupSim_pairwise_avg
#> term_sim_method: Sim_Lin_1998
#> IC_method: IC_annotation
#> collecting all ancestors of input terms ...
#>
#> going through 0 / 6 ancestors ...
#>
#> going through 6 / 6 ancestors ... Done.
#> group2
#> group1 0.2421052