Information content

term_IC(
  dag,
  method,
  terms = NULL,
  control = list(),
  verbose = simona_opt$verbose
)

Arguments

dag: An ontology_DAG object.
method: An IC method. All available methods are in all_term_IC_methods().
terms: A vector of term names. If it is set, the returned vector will be subsetted to the terms that have been set here.
control: A list of parameters passing to individual methods. See the subsections.
verbose: Whether to print messages.

Value

A numeric vector.

Methods

IC_offspring

Denote k as the number of offspring terms plus the term itself and N is such value for root (or the total number of terms in the DAG), the information content is calculated as:

IC = -log(k/N)

IC_height

For a term t in the DAG, denote d as the maximal distance from root (i.e. the depth) and h as the maximal distance to leaves (i.e. the height), the relative position p on the longest path from root to leaves via term t is calculated as:

p = (h + 1)/(h + d + 1)

In the formula where 1 is added gets rid of p = 0. Then the information content is:

IC = -log(p) 
   = -log((h+1)/(h+d+1))

IC_annotation

Denote k as the number of items annotated to a term t, and N is the number of items annotated to the root (which is the total number of items annotated to the DAG), IC for term t is calculated as:

IC = -log(k/N)

In current implementations in other tools, there is an inconsistency of defining k and N. Please see n_annotations() for explanation.

NA is assigned to terms with no item annotated.

IC_universal

It measures the probability of a term getting full transmission from the root. Each term is associated with a p-value and the root has the p-value of 1.

For example, an intermediate term t has two parent terms parent1 and parent2, also assume parent1 has k1 children and parent2 has k2 children, assume a parent transmits information equally to all its children, then respectively parent1 only transmits 1/k1 and parent2 only transmits 1/k2 of its content to term t, or the probability of a parent to reach t is 1/k1 or 1/k2. Let's say p1 and p2 are the accumulated contents from the root to parnet1 and parent2 respectively (or the probability of the two parent terms getting full transmission from root), then the probability of reaching t via a full transmission graph from parent1 is the multiplication of p1 and 1/k1, which is p1/k1, and same for p2/k2. Then, for term t, if getting transmitted from parent1 and parent2 are independent, the probability of t (denoted as p_t) to get transmitted from both parents is:

p_t = (p1/k1) * (p2/k2)

Since the two parents are the full set of t's parents, p_t is the probability of t getting full transmission from root. And the final information content is:

IC = -log(p_t)

Paper link: doi:10.1155/2012/975783 .

IC_Zhang_2006

It measures the number of ways from a term to reach leaf terms. E.g. in the following DAG:

     a      upstream
    /|\
   b | c
     |/
     d      downstream

term a has three ways to reach leaf, which are a->b, a->d and a->c->d.

Let's denote k as the number of ways for term t to reach leaves and N as the maximal value of k which is associated to the root term, the information content is calculated as

IC = -log(k/N) 
   = log(N) - log(k)

Paper link: doi:10.1186/1471-2105-7-135 .

IC_Seco_2004

It is based on the number of offspring terms of term t. The information content is calculated as:

IC = 1 - log(k+1)/log(N)

where k is the number of offspring terms of t, or you can think k+1 is the number of t's offspring terms plus itself. N is the total number of terms on the DAG.

Paper link: doi:10.5555/3000001.3000272 .

IC_Zhou_2008

It is a correction of IC_Seco_2004 which considers the depth of a term in the DAG. The information content is calculated as:

IC = 0.5*IC_Seco + 0.5*log(depth)/log(max_depth)

where depth is the depth of term t in the DAG, defined as the maximal distance from root. max_depth is the largest depth in the DAG. So IC is composed with two parts: the numbers of offspring terms and positions in the DAG.

Paper link: doi:10.1109/FGCNS.2008.16 .

IC_Seddiqui_2010

It is also a correction to IC_Seco_2004, but considers number of relations connecting a term (i.e. number of parent terms and child terms). The information content is defined as:

(1-sigma)*IC_Seco + sigma*log((n_parents + n_children + 1)/log((total_edges + 1))

where n_parents and n_children are the numbers of parents and children of term t. The tuning factor sigma is defined as

sigma = log(total_edges+1)/(log(total_edges) + log(total_terms))

where total_edges is the number of all relations (all parent-child relations) and total_terms is the number of all terms in the DAG.

Paper link: doi:10.5555/1862330.1862343 .

IC_Sanchez_2011

It measures the average contribution of term t on leaf terms. First denote zeta as the number of leaf terms that can be reached from term t (or t's offspring that are leaves.). Since all t's ancestors can also reach t's leaves, the contribution of t on leaf terms is scaled by n_ancestors which is the number of t's ancestor terms. The final information content is normalized by the total number of leaves in the DAG, which is the possible maximal value of zeta. The complete definition of information content is:

IC = -log( (zeta/n_ancestor) / n_all_leaves)

Paper link: doi:10.1016/j.knosys.2010.10.001 .

IC_Meng_2012

It has a complex form which takes account of the term depth and the downstream of the term. The first factor is calculated as:

f1 = log(depth)/long(max_depth)

The second factor is calculated as:

f1 = 1 - log(1 + sum_{x => t's offspring}(1/depth_x))/log(total_terms)

In the equation, the summation goes over t's offspring terms.

The final information content is the multiplication of f1 and f2:

IC = f1 * f2

Paper link: http://article.nadiapub.com/IJGDC/vol5_no3/6.pdf.

There is one parameter correct. If it is set to TRUE, the first factor f1 is calculated as:

f1 = log(depth + 1)/long(max_depth + 1)

correct can be set as:

term_IC(dag, method = "IC_Meng_2012", control = list(correct = TRUE))

IC_Wang_2007

Each relation is weighted by a value less than 1 based on the semantic relation, i.e. 0.8 for "is_a" and 0.6 for "part_of". For a term t and one of its ancestor term a, it first calculates an "S-value" which corresponds to a path from a to t where the accumulated multiplication of weights along the path reaches maximal:

S(a->t) = max_{path}(prod_{node on the paty}(w))

Here max goes over all possible paths from a to t, and prod() multiplies edge weights in a certain path.

The formula can be transformed as (we simply rewrite S(a->t) to S):

1/S = min(prod(1/w))
log(1/S) = log(min(prod(1/w)))
         = min(sum(log(1/w)))

Since w < 1, log(1/w) is positive. According to the equation, the path (a->...->t) is actually the shortest path from a to t by taking log(1/w) as the weight, and log(1/S) is the weighted shortest distance.

If S(a->t) can be thought as the maximal semantic contribution from a to t, the information content is calculated as the sum from all t's ancestors (including t itself):

IC = sum_{a in t's ancestors + t}(S(a->t))

Paper link: doi:10.1093/bioinformatics/btm087 .

The contribution of different semantic relations can be set with the contribution_factor parameter. The value should be a named numeric vector where names should cover the relations defined in relations set in create_ontology_DAG(). For example, if there are two relations "relation_a" and "relation_b" set in the DAG, the value for contribution_factor can be set as:

term_IC(dag, method = "IC_Wang", 
    control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))

Note the IC_Wang_2007 method is normally used within the Sim_Wang_2007 semantic similarity method.

Examples

parents  = c("a", "a", "b", "b", "c", "d")
children = c("b", "c", "c", "d", "e", "f")
annotation = list(
    "a" = c("t1", "t2", "t3"),
    "b" = c("t3", "t4"),
    "c" = "t5",
    "d" = "t7",
    "e" = c("t4", "t5", "t6", "t7"),
    "f" = "t8"
)
dag = create_ontology_DAG(parents, children, annotation = annotation)
term_IC(dag, "IC_annotation")
#> IC_method: IC_annotation
#>         a         b         c         d         e         f 
#> 0.0000000 0.2876821 0.6931472 1.3862944 0.6931472 2.0794415