Semantic similarity between two groups of terms

group_sim(
  dag,
  group1,
  group2,
  method,
  control = list(),
  verbose = simona_opt$verbose
)

Arguments

dag: An ontology_DAG object.
group1: A vector of term names or a list of term vectors.
group2: A vector of term names or a list of term vectors..
method: A group similarity method. All available methods are in all_group_sim_methods().
control: A list of parameters passing to individual methods. The term similarity method is controlled by term_sim_method and the IC method is controlled by IC_method. Other term similarity related parameters can also be specified in control. See the subsections.
verbose: Whether to print messages.

Value

A numeric scalar, a numeric vector or a matrix depending on the dat type of group1 and group2.

Details

If annotation is set in create_ontology_DAG() and you want to directly calculate semantic similarity between two annotated items, you can first get the associated terms of the two items by annotated_terms():

group1 = annotated_terms(dag, item1)[[1]]
group2 = annotated_terms(dag, item2)[[1]]
group_sim(dag, group1, group2, ...)

Methods

GroupSim_pairwise_avg

Denote S(a, b) as the semantic similarity between terms a and b where a is from group1 and b is from group2, The similarity between group1 and group2 is the average similarity of every pair of individual terms in the two groups:

group_sim = mean_{a in group1, b in group2}(S(a, b))

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_avg"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Pape link: doi:10.1093/bioinformatics/btg153 .

GroupSim_pairwise_max

This is the maximal S(a, b) among all pairs of terms in group1 and group2:

group_sim = max_{a in group1, b in group2}(S(a, b))

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_max"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1109/TCBB.2005.50 .

GroupSim_pairwise_BMA

BMA stands for "best-match average". First define similarity of a term to a group of terms as

S(x, group) = max_{y in group}(x, y)

which is the most similar terms in group to x.

Then the BMA similarity is calculated as:

group_sim = 0.5*(mean_{a in group1}(S(a, group2)) + mean_{b in group2}(S(b, group1)))

So it is the average of the similarity of every term in group1 to the whole group2 and every term in group2 to the whole group1.

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMA"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1155/2012/975783 .

GroupSim_pairwise_BMM

BMM stands for "best-match max". It is defined as:

group_sim = max(mean_{a in group1}(S(a, group2)), mean_{b in group2}(S(b, group1)))

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_BMM"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1186/1471-2105-7-302 .

GroupSim_pairwise_ABM

ABM stands for "average best-match". It is defined as:

group_sim = (sum_{a in group1}(S(a, group2)) + sum_{b in group2}(S(b, group1)))/(n1 + n2)

where n1 and n2 are the number of terms in group1 and group2.

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_ABM"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1186/1471-2105-14-284 .

GroupSim_pairwise_HDF

First define the distance of a term to a group of terms:

D(x, group) = 1 - S(x, group)

Then the Hausdorff distance between two groups are:

HDF = max(max_{a in group1}(D(a, group2)), max_{b in group2}(D(b, group1)))

This final similarity is:

group_sim = 1 - HDF

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_HDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

GroupSim_pairwise_MHDF

Instead of using the maximal distance from a group to the other group, MHDF uses mean distance:

MHDF = max(mean_{a in group1}(D(a, group2)), mean_{b in group2}(D(b, group1)))

This final similarity is:

group_sim = 1 - MHDF

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_MHDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1109/ICPR.1994.576361 .

GroupSim_pairwise_VHDF

It is defined as:

VHDF = 0.5*(sqrt(mean_{a in group1}(D(a, group2)^2)) + sqrt(mean_{b in group2}(D(b, group1)^2)))
group_sim = 1 - VHDF

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_VHDF"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1073/pnas.0702965104 .

GroupSim_pairwise_Froehlich_2007

The similarity is:

group_sim = exp(-HDF(group1, group2))

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_Froehlich_2007"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1186/1471-2105-8-166 .

GroupSim_pairwise_Joeng_2014

Similar to VHDF, but it directly uses the similarity:

group_sim = 0.5*(sqrt(mean_{a in group1}(S(a, group2)^2)) + sqrt(mean_{b in group2}(S(b, group1)^2)))

The term semantic similarity method and the IC method can be set via control argument:

group_sim(dag, group1, group2, method = "GroupSim_pairwise_Joeng_2014"
    control = list(term_sim_method = "Sim_Lin_1998", IC_method = "IC_annotation")`.

Other parameters for the term_sim_method can also be set in the control list.

Paper link: doi:10.1109/TCBB.2014.2343963 .

GroupSim_SimALN

It is based on the average distances between every pair of terms in the two groups:

exp(-mean_{a in group1, b in group2}(d(a, b)))

d(a, b) is the distance between a and b, which can be the shortest distance between the two terms or the longest distnace via LCA.

There is a parameter distance which takes value of "longest_distances_via_LCA" (the default) or "shortest_distances_via_NCA":

group_sim(dag, group1, group2, method = "GroupSim_SimALN",
    control = list(distance = "shortest_distances_via_NCA"))

Paper link: doi:10.1109/ISCC.2008.4625763 .

GroupSim_SimGIC

Denote A and B as the two sets of ancestors terms of terms in group1 and group2 respectively, the SimGIC is:

group_sim = sum_{x in intersect(A, B)}(IC(x))/sum_{x in union(A, B)}(IC(x))

IC method can be set via control = list(IC_method = ...).

GroupSim_SimDIC

Similar as GroupSim_SimGIC, it calculates the Dice coeffcient:

group_sim = 2*sum_{x in intersect(A, B)}(IC(x))/(sum_{x in A}(IC(x)) + sum_{x in B}(IC(x)))

IC method can be set via control = list(IC_method = ...).

GroupSim_SimUIC

Similar as GroupSim_SimGIC, it is calculated as:

group_sim = sum_{x in intersect(A, B)}(IC(x))/max(sum_{x in A}(IC(x)), sum_{x in B}(IC(x)))

IC method can be set via control = list(IC_method = ...).

GroupSim_SimUI

It is only based on the number of terms. A is the set of all ancestors of group1 terms and B is the set of all ancestors of group2 terms.

group_sim = length(intersect(A, B))/length(union(A, B))

GroupSim_SimDB

It is:

group_sim = 2*length(intersect(A, B))/(length(A) + length(B))

GroupSim_SimUB

It is:

group_sim = length(intersect(A, B))/max(length(A), length(B))

GroupSim_SimNTO

It is:

group_sim = length(intersect(A, B))/min(length(A), length(B))

GroupSim_SimCOU

It is based on the dot product of two vectors p and q which correspond to terms in group1 and group2. p and q have the same length as the total number of terms. Value of position i in p or q corresponds to term t. The value takes IC(t) if t is an ancestor of any term in p or q, and the value takes zero if t is not. The similarity betweem group1 terms and group2 terms is calculated as:

<p,q>/||p||/||q||

where <p,q> is the dot product between the two, and ||p|| or ||q|| is the norm of the vector. The equation can be written as:

group_sim = sum_{x in intersect(A, B)}(IC(x)^2) / 
              sqrt(sum_{x in A}(IC(x)^2)) / 
              sqrt(sum_{x in B}(IC(x)^2))

IC method can be set via control = list(IC_method = ...).

GroupSim_SimCOT

Similar as GroupSim_SimCOU, the similarity is:

<p,q>/(||p||^2 + ||q||^2 - <p,q>)

And it can be rewritten as:

group_sim = sum_{x in intersect(A, B)}(IC(x)^2) /
    (sum_{x in A}(IC(x)^2) + sum_{x in B}(IC(x)^2) - sum_{x in intersect(A, B)}(IC(x)^2))

IC method can be set via control = list(IC_method = ...).

GroupSim_SimLP

It is the longest depth for the terms in intersect(A, B).

group_sim = max(depth(intersect(A, B)))

GroupSim_Ye_2005

It is a normalized version of GroupSim_SimLP:

group_sim = max(depth(intersect(A, B)))/max_depth

Since the minimal depth is zero for root.

GroupSim_SimCHO

It is based on the annotated items. Denote sigma(t) as the total annotated items of t. The similarity is calculated as

group_sim = log(C/sigma_max)/log(sigma_min/sigma_max)

where C is min(sigma_{x in intersect(A, B)}(x)), i.e., the minimal sigma in the intersection of group1 and group2. Note Now A and B are just two sets of terms in group1 and group2. sigma_max is the total number of items annotated to the DAG, sig_min is the minimal number of items annotated to a term, which is mostly 1.

GroupSim_SimALD

A and B are just two sets of terms in group1 and group2. The similarity is calculated as:

group_sim = max_{t in intersect(A, B)}(1 - sigma(t)/N)

GroupSim_Jaccard

Say A is the set of items annotated to terms in group1 and B is the set of items annotated to group2. This is the Jaccard coeffcient between two sets.

The universe/background can be set via control = list(universe = ...).

GroupSim_Dice

It is the Dice coeffcient between A and B.

The universe/background can be set via control = list(universe = ...).

GroupSim_Overlap

It is the Overlap coeffcient between A and B.

The universe/background can be set via control = list(universe = ...).

GroupSim_Kappa

The universe/background can be set via control = list(universe = ...).

Examples

parents  = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
    "a" = c("t1", "t2", "t3"),
    "b" = c("t3", "t4"),
    "c" = "t5",
    "d" = "t7",
    "e" = c("t4", "t5", "t6", "t7"),
    "f" = "t8"
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
group_sim(dag, c("c", "e"), c("d", "f"), 
    method = "GroupSim_pairwise_avg", 
    control = list(term_sim_method = "Sim_Lin_1998")
)
#> group_sim_method: GroupSim_pairwise_avg
#> term_sim_method: Sim_Lin_1998
#> IC_method: IC_annotation
#> collecting all ancestors of input terms ...
#> 
#> going through 0 / 6 ancestors ...
#> 
#> going through 6 / 6 ancestors ... Done.
#>           group2
#> group1 0.2421052