vignettes/v3_import.Rmd
v3_import.Rmd
There are several formats for ontology data. The most compact and readable format is the .obo
format, which was initially developed by the GO consortium. A lot of ontologies in .obo
format can be found from the OBO Foundry or BioPortal. A description of the .obo
format can be found from https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html.
In the simona package, the function import_obo()
can be used to import an .obo
file to an ontology_DAG
object. The input is a path on local computer or an URL. In the following example, we use the Plant Ontology as an example.
The link of po.obo
can be found from that web package. You can download it or directly provide it as an URL.
library(simona)
dag1 = import_obo("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.obo")
dag1
## An ontology_DAG object:
## Source: po, releases/2023-07-13
## 1656 terms / 1776 relations
## Root: ~~all~~
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 11
## Avg number of parents: 1.07
## Avg number of children: 1.06
## Aspect ratio: 39:1 (based on the longest distance from root)
## 38.2:1 (based on the shortest distance from root)
## Relations: is_a
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
There are also several meta columns attached to the object, such as the name and the long definition of terms in the ontology.
head(mcols(dag1))
## id short_id name namespace
## PO:0000001 PO:0000001 PO:0000001 plant embryo proper plant_anatomy
## PO:0000002 PO:0000002 PO:0000002 anther wall plant_anatomy
## PO:0000003 PO:0000003 PO:0000003 whole plant plant_anatomy
## PO:0000004 PO:0000004 PO:0000004 in vitro plant structure plant_anatomy
## PO:0000005 PO:0000005 PO:0000005 cultured plant cell plant_anatomy
## PO:0000006 PO:0000006 PO:0000006 plant protoplast plant_anatomy
## definition
## PO:0000001 An embryonic plant structure (PO:0025099) that is the body of a developing plant embryo (PO:0009009) attached to the maternal tissue in an plant ovule (PO:0020003) by a suspensor (PO:0020108).
## PO:0000002 A microsporangium wall (PO:0025307) that is part of an anther (PO:0009066).
## PO:0000003 A plant structure (PO:0005679) which is a whole organism.
## PO:0000004 A plant structure (PO:0009011) that is grown or maintained in vitro.
## PO:0000005 A plant cell (PO:0009002) that is grown or maintained in vitro.
## PO:0000006 A cultured plant cell from which the entire plant cell wall has been removed.
Note rows in mcols(dag1)
corresponds to terms in dag_all_terms(dag)
.
The is_a
relation between classes is of course saved in the DAG object (specified in the is_a
tag in the .obo
file). Additional relation types can also be selected (specified in the relationship
tag). By default only the relation type part_of
is used. You can check other values associated with the relationship
tag and the [Typedef]
section in the .obo
file to select proper additional relation types. Just make sure that the selected relation types are transitive and not inversed (e.g. you cannot select has_part
which is a reversed relation of part_of
).
Relations can also have a DAG structure. In import_obo()
, if a parent relation type is selected, all its offspring types are automatically selected. For example, in GO, besides relations of is_a
and part_of
, there are also regulates
, positively_regulates
and negatively_regulates
, where the latter two are child relations of regulates
. So if regulates
is selected as an additional relation type, the other two are automatically selected.
The DAG of relation types is automatically recognized and saved from the ontology files.
import_obo("file_for_go.obo", relation_type = c("part_of", "regulates"))
Finally, all the spaces specified in relation_type
will be converted to underlines. So it is the same if you specify "part of"
or "part_of"
.
For ontologies in other formats, simona uses an external tool ROBOT to convert them to .obo
format and later internally uses import_obo()
to import them. ROBOT is already doing a great and professional job of converting between different ontology formats. The file robot.jar
is needed and it can be downloaded from https://github.com/ontodev/robot/releases (Since this is a tool in Java, you should have Java already available on your machine).
The file po.owl
can also be found from the Plant Ontology web page.
dag2 = import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl",
robot_jar = "~/Downloads/robot.jar")
dag2
## An ontology_DAG object:
## Source: po, releases/2023-07-13
## 1656 terms / 1776 relations
## Root: ~~all~~
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 11
## Avg number of parents: 1.07
## Avg number of children: 1.06
## Aspect ratio: 39:1 (based on the longest distance from root)
## 38.2:1 (based on the shortest distance from root)
## Relations: is_a
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
More conveniently, the path of robot.jar
can be set as a global option:
simona_opt$robot_jar = "~/Downloads/robot.jar"
import_ontology("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl")
ROBOT supports the following ontology formats and they are automatically identified according to the file contents.
json
: OBO Graphs JSONobo
: OBO Formatofn
: OWL Functionalomn
: Manchesterowl
: RDF/XMLowx
: OWL/XMLttl
: TurtleFor some huge ontologies, ROBOT requires a huge amount of memory to convert to the .obo
format. If the ontology is in the .owl
format (in the RDF/XML seriation format), the function import_owl()
can be optionally used. import_owl()
directly parses the .owl
file and returns an ontology_DAG
object. The import_owl()
is written from scratch and it is recommended to use only when import_ontology()
does not work.
dag3 = import_owl("https://raw.githubusercontent.com/Planteome/plant-ontology/master/po.owl")
dag3
## An ontology_DAG object:
## Source: Plant Ontology, http://purl.obolibrary.org/obo/po/releases/2023-07-13/po.owl
## 1656 terms / 1776 relations
## Root: ~~all~~
## Terms: PO:0000001, PO:0000002, PO:0000003, PO:0000004, ...
## Max depth: 11
## Avg number of parents: 1.07
## Avg number of children: 1.06
## Aspect ratio: 39:1 (based on the longest distance from root)
## 38.2:1 (based on the shortest distance from root)
## Relations: is_a
##
## With the following columns in the metadata data frame:
## id, short_id, name, namespace, definition
Similarly, some ontologies may only provide large .ttl
format files (the Turtle format). simona also provides a function import_ttl()
which can recognize .ttl
file with owl:Class
as objects. The internal parsing script is written in Perl, so you need to make sure Perl is installed on your machine.
# https://bioportal.bioontology.org/ontologies/MSTDE
dag4 = import_ttl("https://jokergoo.github.io/simona/MSTDE.ttl")
dag4
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] simona_1.1.3 knitr_1.44
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 xml2_1.3.5 shape_1.4.6
## [4] stringi_1.7.12 digest_0.6.33 magrittr_2.0.3
## [7] evaluate_0.22 grid_4.3.1 RColorBrewer_1.1-3
## [10] iterators_1.0.14 circlize_0.4.15 fastmap_1.1.1
## [13] foreach_1.5.2 doParallel_1.0.17 rprojroot_2.0.3
## [16] jsonlite_1.8.7 GlobalOptions_0.1.2 promises_1.2.1
## [19] ComplexHeatmap_2.16.0 purrr_1.0.2 codetools_0.2-19
## [22] textshaping_0.3.7 jquerylib_0.1.4 shiny_1.6.0
## [25] cli_3.6.1 rlang_1.1.1 crayon_1.5.2
## [28] scatterplot3d_0.3-44 ellipsis_0.3.2 cachem_1.0.8
## [31] yaml_2.3.7 tools_4.3.1 parallel_4.3.1
## [34] memoise_2.0.1 colorspace_2.1-0 httpuv_1.6.11
## [37] GetoptLong_1.0.5 BiocGenerics_0.46.0 curl_5.1.0
## [40] mime_0.12 vctrs_0.6.4 R6_2.5.1
## [43] png_0.1-8 matrixStats_1.0.0 stats4_4.3.1
## [46] lifecycle_1.0.3 stringr_1.5.0 S4Vectors_0.38.2
## [49] fs_1.6.3 IRanges_2.34.1 clue_0.3-65
## [52] cluster_2.1.4 ragg_1.2.6 pkgconfig_2.0.3
## [55] desc_1.4.2 later_1.3.1 pkgdown_2.0.7
## [58] bslib_0.5.1 Rcpp_1.0.11 glue_1.6.2
## [61] systemfonts_1.0.5 xfun_0.40 xtable_1.8-4
## [64] rjson_0.2.21 igraph_1.5.1 htmltools_0.5.6.1
## [67] rmarkdown_2.25 Polychrome_1.5.1 compiler_4.3.1