vignettes/v04_information_content.Rmd
v04_information_content.Rmd
The methods of information content implemented in simona are mainly from the supplementary file of the paper “Mazandu et al., Gene Ontology semantic similarity tools: survey on features and challenges for biological knowledge discovery. Briefings in Bioinformatics 2017”. Original denotations have been slightly modified to make them more consistent. Also more explanations have been added in this vignette.
The following denotations will be used throughout the vignette. The denotations are mainly from Mazandu 2017 only with small modifications.
Denotation | Description |
---|---|
The root term of the DAG. In simona there is always one root term. | |
The depth of a term in the DAG, which is the longest distance from root . | |
The length of the longest path from root to a term via term . | |
The maximal depth in the DAG. | |
The height of term in the DAG, which is the longest finite distance to leaf terms. | |
The set of child terms of term . | |
The set of parent terms of term . | |
The set of ancestor terms of term . | |
The set of ancestor terms of term , including itself. | |
The set of offspring terms of term . | |
The set of offspring terms of term , including itself. | |
The set of leaf terms that term can reach. | |
Number of elements in set . | |
The shortest distance bewteen and . | |
The longest distance bewteen and . | |
The length of the longest path from and via . | |
The set of common ancestors of term and , i.e. . | |
Lowest common ancestor of and , which is the common ancestor with the largest depth in DAG, i.e. There might be more than one LCA terms for given two terms, to simplify the calculation, the one with the longest distance (the default) to and is used. | |
Nearest common ancestor of and , i.e. If there are more than one NCA terms, the one with the largest depth (the lowest one) is used. | |
Most informative common ancestor of and , i.e. There might be more than one MICA terms for given two terms, the one with the longest distance (the default) to and is used. | |
The set of annotated items on term . |
Assume term is an ancestor of term , (the order of and does not matter) is the normal shortest distance from to in a directed graph. The definition is similar for .
If term and are not in offspring/ancestor relationship, i.e. is not an ancestor of , and is not an ancestor of , then
For many information content (IC) definitions, it has the following form:
where is the probability of term under a certain measure in the DAG. If a term has a smaller probability, it means it is more specific, the form will have a larger value, thus it is called the “information content”. Basically the more to the leaves in the DAG, a smaller p-value is gained and the higher information content a term has.
IC can be generalized not necessarily to be determined by p-values, but to a metric which is non-negative and a larger value of a term means more informative/important/deep in the DAG. One valid definition of such generalized IC is the term depth where with a larger value of the depth means more specific a term is in the DAG.
In the following sections, I will use Gene Ontology as an example to demonstrate various IC methods supported in the simona package. Note here I also set the gene annotation because some IC methods depend on the number of annotated genes on GO terms.
library(simona)
dag = create_ontology_DAG_from_GO_db(namespace = "BP", org_db = "org.Hs.eg.db")
dag
## An ontology_DAG object:
## Source: GO BP / GO.db package 3.19.1
## 27186 terms / 54178 relations
## Root: GO:0008150
## Terms: GO:0000001, GO:0000002, GO:0000003, GO:0000011, ...
## Max depth: 18
## Avg number of parents: 1.99
## Avg number of children: 1.87
## Aspect ratio: 356.46:1 (based on the longest distance from root)
## 756.89:1 (based on the shortest distance from root)
## Relations: is_a, part_of
## Annotations: 18888 items
## 291, 1890, 4205, 4358, ...
##
## With the following columns in the metadata data frame:
## id, name, definition
The wrapper function term_IC()
calculates information contents of terms in the DAG with a specific method. Note the method name can be partially matched. control
argument controls parameters for specific methods.
term_IC(dag, method = ..., control = list(...))
All supported term IC methods are:
## [1] "IC_offspring" "IC_height" "IC_annotation" "IC_universal"
## [5] "IC_Zhang_2006" "IC_Seco_2004" "IC_Zhou_2008" "IC_Seddiqui_2010"
## [9] "IC_Sanchez_2011" "IC_Meng_2012" "IC_Wang_2007"
The information content is calculated as:
In the bracket is the fraction of number of offspring terms to the total number of terms in the DAG.
term_IC(dag, method = "IC_offspring")
It measures the relative position of a term in the DAG:
where is the height of a term and is the depth of the term. 1 is added in the formula to get rid of for leaf terms.
term_IC(dag, method = "IC_height")
Due to the nature of DAG, if a term is annotated to an item, all its ancestor terms are also associated with that item. The calculation of annotated items is applied in a recursive way.
Denote as the set of items directly annotated to , and as the set of items annotated to after merging from its child terms. is the union of all items annotated to its child terms and the items directly annotated to .
And the information content is:
where is the set of all items annotated to the DAG (i.e. to the root).
term_IC(dag, method = "IC_annotation")
NA
is assigned to the terms with no item annotated.
can also be written as:
In some implementation of current tools, number of annotated items on a term is calculated by simply adding such numbers from all its child terms, as shown in the following equation:
where is the number of items directly annotated to term . It should be noted that this way over-estimates the number of annotated items because an item can be annotated to multiple terms. For example, a gene can be annotated to multiple GO terms (you can check the object org.Hs.eg.db::org.Hs.egGO
. And even for the same term, it may have duplicated annotated genes with multiple evidence codes (e.g. the gene with Entrez ID "994"
is annotated to a GO term "GO:0000086"
twice with evidence code "IBA"
and "TAS"
).
Since annotated-based is the probability of a term being annotated when randomly selecting an item, the second way might not be proper. The two methods are identical only if an item is uniquely annotated to a term in the DAG.
We can compare the two different implementations with or without uniquifying annotations. We can specify an additional parameter for the IC_annotation method: control = list(uniquify = FALSE)
to use the second method.
ic1 = term_IC(dag, method = "IC_annotation")
ic2 = term_IC(dag, method = "IC_annotation", control = list(uniquify = FALSE))
# ranges on both x- and y-axes
rg = c(0, max(ic1, ic2, na.rm = TRUE))
plot(ic1, ic2, xlim = rg, ylim = rg,
xlab = "uniquified (first method)", ylab = "not uniquified (second method)",
pch = 16, col = "#00000020", main = "compare IC_annotation")
abline(a = 0, b = 1, col = "red")
It shows there is a systematic shift between the two methods.
IC_annotation is the only method which replies on external annotation data. This type of method has been criticized because it replies on changable external data and people argue that IC should be an intrinsic attribute and only depends on the DAG structure itself.
All other IC methods introduced in later sections only purely depend on the DAG structure itself.
It measures the probability of a term getting full transmission from the root. Each term is associated with a p-value and the root has the p-value of 1.
To understand this method, let’s think the DAG relates to a kind of information transmission from the root to the inner DAG, where information is 100% on root. Then, in a certain level in the DAG, a term has two parents and , also let’s assume has children and has children, next we assume a parent transmits information equally to all its children, then only transmits and only transmits of its content to term , on which we could also say, the probability of or to reach is or . Let’s assume and are the accumulated contents from the root to and already (or the probability of the two parent terms getting full transmission from root), then the probability of reaching via a full transmission graph from is the multiplication of and , which is , and it is similar for . Then, for term , if getting transmitted from and are independent, the probability of (denoted as ) to get transmitted from both parents is:
Since the two parents are the full set of ’s parents, is the probability of getting full transmission from root. And the final information content is:
term_IC(dag, method = "IC_universal")
The formal definition is:
Paper link: https://doi.org/10.1155/2012/975783.
It measures the number of ways of a term to reach leaf terms. E.g. in the following DAG:
a upstream
/|\
b | c
|/
d downstream
Term “a” has three ways to reach leaves, which are a->b
, a->d
and a->c->d
.
The number of ways of term to reach leaf terms can be obtained in a recursively way:
And the information content:
where is the number of ways to reach all leaf terms from root.
term_IC(dag, method = "IC_Zhang_2006")
Paper link: https://doi.org/10.1186/1471-2105-7-135.
It is based on the number of offspring terms (with itself) of term . The information content is calculated as:
where is the total number of terms in the DAG.
term_IC(dag, method = "IC_Seco_2004")
IC_Seco_2004 is actually a normalized version of IC_offspring:
where for leaf terms ().
Paper link: https://dl.acm.org/doi/10.5555/3000001.3000272.
It is a correction of IC_Seco_2004 which also considers the depth of a term in the DAG. The information content is calculated as:
And IC is set to 0 for root term.
term_IC(dag, method = "IC_Zhou_2008")
Paper link: https://doi.org/10.1109/FGCNS.2008.16.
It is also a correction of IC_Seco_2004 but considers the number of relations a term has in the DAG.
where is the number of relations of , i.e. number of its parent and child terms, is the total number of edges in the DAG. The tunning factor is defined as:
where is the total number of nodes in the DAG. When the DAG has a tree structure, i.e. a term only has one parent and root has no parent, there is the following relation:
In this case, , and
term_IC(dag, method = "IC_Seddiqui_2010")
Paper link: https://dl.acm.org/doi/10.5555/1862330.1862343.
It measures the average contribution of term on leaf terms. First denote as the number of leaf terms that can be reached from term . Since all ’s ancestors can also reach ’s leaves, the contribution of on leaf terms is scaled by the number of its ancestors. The final information content is normalized by the total number of leaves in the DAG.
where is the total number of leaves in the DAG.
term_IC(dag, method = "IC_Sanchez_2011")
Paper link: https://doi.org/10.1016/j.knosys.2010.10.001.
It has a complex form which takes account of the term depth and the downstream of the term. The first factor is calculated as:
The second factor is calculated as:
In the equation, the summation goes over ’s offspring terms.
The final information content is the multiplication of and :
IC for root is set to 0.
term_IC(dag, method = "IC_Meng_2012")
Paper link: http://article.nadiapub.com/IJGDC/vol5_no3/6.pdf.
Each child-parent relation is weighted by a value smaller than 1 based on the semantic relation type, i.e. 0.8 for “is_a” and 0.6 for “part_of”. For a term and one of its ancestor term , it first calculates an “S-value” which corresponds to a path from to where the accumulated multiplication of weights along the path reaches maximal:
To properly understand this setting, let’s first apply the following transformation:
And the definition from is converted to :
In this way, is actually the shortest distance from to taking as the edge weight. We can obtain back by:
The information content is the summation over all ’s ancestors:
Normally IC_Wang_2007 is only used for calculating the semantic similarity using Sim_Wang_2007 method.
The contribution of different semantic relations can be set with the contribution_factor
parameter. The value should be a named numeric vector where names should cover the relations defined in relations set in create_ontology_DAG()
. For example, if there are two relations “relation_a” and “relation_b” set in the DAG, the value for contribution_factor can be set as:
term_IC(dag, method = "IC_Wang_2007", control = list(contribution_factor = c("relation_a" = 0.8, "relation_b" = 0.6)))
By default 0.8 is set for “is_a” and 0.6 for “part_of”.
If you are not sure what types of relations have been set, simply type the dag
object. The relation types are all printed there.
Paper link: https://doi.org/10.1093/bioinformatics/btm087.
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] simona_1.3.12 knitr_1.48
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.44.1 circlize_0.4.16 shape_1.4.6.1
## [4] rjson_0.2.22 xfun_0.47 bslib_0.8.0
## [7] htmlwidgets_1.6.4 GlobalOptions_0.1.2 Biobase_2.64.0
## [10] vctrs_0.6.5 tools_4.4.1 stats4_4.4.1
## [13] parallel_4.4.1 Polychrome_1.5.1 AnnotationDbi_1.66.0
## [16] RSQLite_2.3.7 highr_0.11 cluster_2.1.6
## [19] blob_1.2.4 pkgconfig_2.0.3 RColorBrewer_1.1-3
## [22] desc_1.4.3 S4Vectors_0.42.1 scatterplot3d_0.3-44
## [25] GenomeInfoDbData_1.2.12 lifecycle_1.0.4 compiler_4.4.1
## [28] textshaping_0.4.0 Biostrings_2.72.1 codetools_0.2-20
## [31] ComplexHeatmap_2.20.0 clue_0.3-65 GenomeInfoDb_1.40.1
## [34] httpuv_1.6.15 htmltools_0.5.8.1 sass_0.4.9
## [37] yaml_2.3.10 later_1.3.2 pkgdown_2.1.0
## [40] crayon_1.5.3 jquerylib_0.1.4 GO.db_3.19.1
## [43] cachem_1.1.0 org.Hs.eg.db_3.19.1 iterators_1.0.14
## [46] foreach_1.5.2 mime_0.12 digest_0.6.37
## [49] fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [52] cli_3.6.3 magrittr_2.0.3 UCSC.utils_1.0.0
## [55] promises_1.3.0 bit64_4.0.5 XVector_0.44.0
## [58] rmarkdown_2.28 httr_1.4.7 matrixStats_1.3.0
## [61] igraph_2.0.3 bit_4.0.5 ragg_1.3.2
## [64] png_0.1-8 GetoptLong_1.0.5 memoise_2.0.1
## [67] shiny_1.9.1 evaluate_0.24.0 IRanges_2.38.1
## [70] doParallel_1.0.17 rlang_1.1.4 Rcpp_1.0.13
## [73] xtable_1.8-4 DBI_1.2.3 xml2_1.3.6
## [76] BiocGenerics_0.50.0 jsonlite_1.8.8 R6_2.5.1
## [79] zlibbioc_1.50.0 systemfonts_1.1.0 fs_1.6.4