vignettes/topic1_06_UniProtKeywords.Rmd
topic1_06_UniProtKeywords.Rmd
UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way.
First load the package:
The release and source information of the data:
UniProtKeywords
## UniProt Keywords
## Release: 2023_01 of 22-Feb-2023
## Source: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/keywlist.txt
## Number of keywords: 1201
## Built date: 2023-03-22
The UniProtKeywords package has compiled genesets of keywords for some organism, which can get by the function load_keyword_genesets()
. The argument is the taxon ID of a organism. The full set of supported organisms can be found in the document of load_keyword_genesets()
(or in the object UniProtKeywords:::ORGANISM
).
gl = load_keyword_genesets(9606)
gl[3:4] # because gl[1:2] has a very long output, here we print gl[3:4]
## $`3Fe-4S`
## [1] "6390"
##
## $`4Fe-4S`
## [1] "6059" "48" "50" "54901" "64428" "51654" "57019" "1663"
## [9] "1763" "5424" "5426" "1806" "55140" "2068" "2110" "64789"
## [17] "83990" "3658" "11019" "4337" "4595" "4719" "4720" "374291"
## [25] "4728" "4723" "4682" "10101" "5558" "5471" "5980" "55316"
## [33] "91543" "51750" "6390" "441250" "55253"
You can also use the name of the organism:
load_keyword_genesets("human")
load_keyword_genesets("Homo sapiens")
Argument as_table
can be set to TRUE
, then load_keyword_genesets()
returns a two-column data frame.
tb = load_keyword_genesets(9606, as_table = TRUE)
head(tb)
## keyword gene
## 1 2Fe-2S 2230
## 2 2Fe-2S 150209
## 3 2Fe-2S 316
## 4 2Fe-2S 55847
## 5 2Fe-2S 493856
## 6 2Fe-2S 284106
We can simply check some statistics.
plot(table(sapply(gl, length)), log = "x",
xlab = "Size of keyword genesets",
ylab = "Number of keywords"
)
plot(table(sapply(gregexpr(" |-|/", names(gl)), length)),
xlab = "Number of words in keywords",
ylab = "Number of keywords"
)