Information content
term_IC(
dag,
method,
terms = NULL,
control = list(),
verbose = simona_opt$verbose
)
An ontology_DAG
object.
An IC method. All available methods are in all_term_IC_methods()
.
A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here.
A list of parameters passing to individual methods. See the subsections.
Whether to print messages.
A numeric vector.
Denote k
as the number of offspring terms plus the term itself and N
is such value for root (or the total number of terms in the DAG), the information
content is calculated as:
= -log(k/N) IC
For a term t
in the DAG, denote d
as the maximal distance from root (i.e. the depth) and h
as the maximal distance to leaves (i.e. the height),
the relative position p
on the longest path from root to leaves via term t
is calculated as:
= (h + 1)/(h + d + 1) p
In the formula where 1 is added gets rid of p = 0
. Then the information content is:
= -log(p)
IC = -log((h+1)/(h+d+1))
Denote k
as the number of items annotated to a term t
, and N
is the number of items annotated to the root (which is
the total number of items annotated to the DAG), IC for term t
is calculated as:
= -log(k/N) IC
In current implementations in other tools, there is an inconsistency of defining k
and N
.
Please see n_annotations()
for explanation.
NA
is assigned to terms with no item annotated.
It measures the probability of a term getting full transmission from the root. Each term is associated with a p-value and the root has the p-value of 1.
For example, an intermediate term t
has two parent terms parent1
and parent2
, also assume parent1
has k1
children
and parent2
has k2
children, assume a parent transmits information equally to all its children, then respectively parent1
only transmits 1/k1
and
parent2
only transmits 1/k2
of its content to term t
, or the probability of a parent to reach t
is 1/k1
or 1/k2
.
Let's say p1
and p2
are the accumulated contents from the root to parnet1
and parent2
respectively (or the probability
of the two parent terms getting full transmission from root), then the probability of reaching t
via a full transmission graph from parent1
is the multiplication of p1
and 1/k1
, which is p1/k1
, and same for p2/k2
. Then, for term t
, if getting transmitted from parent1
and
parent2
are independent, the probability of t
(denoted as p_t
) to get transmitted from both parents is:
= (p1/k1) * (p2/k2) p_t
Since the two parents are the full set of t
's parents, p_t
is the probability of t
getting full transmission from root. And the final
information content is:
= -log(p_t) IC
Paper link: doi:10.1155/2012/975783 .
It measures the number of ways from a term to reach leaf terms. E.g. in the following DAG:
a upstream/|\
| c
b |/
d downstream
term a
has three ways to reach leaf, which are a->b
, a->d
and a->c->d
.
Let's denote k
as the number of ways for term t
to reach leaves and N
as the maximal value of k
which
is associated to the root term, the information content is calculated as
= -log(k/N)
IC = log(N) - log(k)
Paper link: doi:10.1186/1471-2105-7-135 .
It is based on the number of offspring terms of term t
.
The information content is calculated as:
= 1 - log(k+1)/log(N) IC
where k
is the number of offspring terms of t
, or you can think k+1
is the number of t
's offspring terms plus itself.
N
is the total number of terms on the DAG.
Paper link: doi:10.5555/3000001.3000272 .
It is a correction of IC_Seco_2004 which considers the depth of a term in the DAG. The information content is calculated as:
= 0.5*IC_Seco + 0.5*log(depth)/log(max_depth) IC
where depth
is the depth of term t
in the DAG, defined as the maximal distance from root. max_depth
is the largest depth in the DAG.
So IC is composed with two parts: the numbers of offspring terms and positions in the DAG.
Paper link: doi:10.1109/FGCNS.2008.16 .
It is also a correction to IC_Seco_2004, but considers number of relations connecting a term (i.e. number of parent terms and child terms). The information content is defined as:
1-sigma)*IC_Seco + sigma*log((n_parents + n_children + 1)/log((total_edges + 1)) (
where n_parents
and n_children
are the numbers of parents and children of term t
. The tuning factor sigma
is defined as
= log(total_edges+1)/(log(total_edges) + log(total_terms)) sigma
where total_edges
is the number of all relations (all parent-child relations)
and total_terms
is the number of all terms in the DAG.
Paper link: doi:10.5555/1862330.1862343 .
It measures the average contribution of term t
on leaf terms. First denote zeta
as the number of leaf terms that
can be reached from term t
(or t
's offspring that are leaves.). Since all t
's ancestors can also
reach t
's leaves, the contribution of t
on leaf terms is scaled by n_ancestors
which is the number of t
's ancestor terms.
The final information content is normalized by the total number of leaves in the DAG, which is the possible maximal value of zeta
.
The complete definition of information content is:
= -log( (zeta/n_ancestor) / n_all_leaves) IC
Paper link: doi:10.1016/j.knosys.2010.10.001 .
It has a complex form which takes account of the term depth and the downstream of the term. The first factor is calculated as:
= log(depth)/long(max_depth) f1
The second factor is calculated as:
= 1 - log(1 + sum_{x => t's offspring}(1/depth_x))/log(total_terms) f1
In the equation, the summation goes over t
's offspring terms.
The final information content is the multiplication of f1
and f2
:
= f1 * f2 IC
Paper link: http://article.nadiapub.com/IJGDC/vol5_no3/6.pdf.
There is one parameter correct
. If it is set to TRUE
, the first factor f1
is calculated as:
= log(depth + 1)/long(max_depth + 1) f1
correct
can be set as:
term_IC(dag, method = "IC_Meng_2012", control = list(correct = TRUE))
Each relation is weighted by a value less than 1 based on the semantic relation, i.e. 0.8 for "is_a" and 0.6 for "part_of".
For a term t
and one of its ancestor term a
, it first calculates an "S-value" which corresponds to a path from a
to t
where
the accumulated multiplication of weights along the path reaches maximal:
S(a->t) = max_{path}(prod_{node on the paty}(w))
Here max
goes over all possible paths from a
to t
, and prod()
multiplies edge weights in a certain path.
The formula can be transformed as (we simply rewrite S(a->t)
to S
):
1/S = min(prod(1/w))
log(1/S) = log(min(prod(1/w)))
= min(sum(log(1/w)))
Since w < 1
, log(1/w)
is positive. According to the equation, the path (a->...->t
) is actually the shortest path from a
to t
by taking
log(1/w)
as the weight, and log(1/S)
is the weighted shortest distance.
If S(a->t)
can be thought as the maximal semantic contribution from a
to t
, the information content is calculated
as the sum from all t
's ancestors (including t
itself):
= sum_{a in t's ancestors + t}(S(a->t)) IC
Paper link: doi:10.1093/bioinformatics/btm087 .
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric
vector where names should cover the relations defined in relations
set in create_ontology_DAG()
. For example, if there are two relations
"relation_a" and "relation_b" set in the DAG, the value for contribution_factor
can be set as:
term_IC(dag, method = "IC_Wang",
control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
Note the IC_Wang_2007 method is normally used within the Sim_Wang_2007 semantic similarity method.
parents = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
"a" = c("t1", "t2", "t3"),
"b" = c("t3", "t4"),
"c" = "t5",
"d" = "t7",
"e" = c("t4", "t5", "t6", "t7"),
"f" = "t8"
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
term_IC(dag, "IC_annotation")
#> IC_method: IC_annotation
#> a b c d e f
#> 0.0000000 0.2876821 0.6931472 1.3862944 0.6931472 2.0794415