UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way.

The UniProtKeywords package

First load the package:

The release and source information of the data:

UniProtKeywords
## UniProt Keywords
##   Release: 2023_01 of 22-Feb-2023 
##   Source: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/keywlist.txt 
##   Number of keywords: 1201 
##   Built date: 2023-03-22

The UniProtKeywords package has compiled genesets of keywords for some organism, which can get by the function load_keyword_genesets(). The argument is the taxon ID of a organism. The full set of supported organisms can be found in the document of load_keyword_genesets() (or in the object UniProtKeywords:::ORGANISM).

gl = load_keyword_genesets(9606)
gl[3:4]  # because gl[1:2] has a very long output, here we print gl[3:4]
## $`3Fe-4S`
## [1] "6390"
## 
## $`4Fe-4S`
##  [1] "6059"   "48"     "50"     "54901"  "64428"  "51654"  "57019"  "1663"  
##  [9] "1763"   "5424"   "5426"   "1806"   "55140"  "2068"   "2110"   "64789" 
## [17] "83990"  "3658"   "11019"  "4337"   "4595"   "4719"   "4720"   "374291"
## [25] "4728"   "4723"   "4682"   "10101"  "5558"   "5471"   "5980"   "55316" 
## [33] "91543"  "51750"  "6390"   "441250" "55253"

You can also use the name of the organism:

Argument as_table can be set to TRUE, then load_keyword_genesets() returns a two-column data frame.

tb = load_keyword_genesets(9606, as_table = TRUE)
head(tb)
##   keyword   gene
## 1  2Fe-2S   2230
## 2  2Fe-2S 150209
## 3  2Fe-2S    316
## 4  2Fe-2S  55847
## 5  2Fe-2S 493856
## 6  2Fe-2S 284106

Statistics

We can simply check some statistics.

  1. Sizes of keyword genesets:
plot(table(sapply(gl, length)), log = "x", 
    xlab = "Size of keyword genesets",
    ylab = "Number of keywords"
)

  1. Numbers of words in keywords:
plot(table(sapply(gregexpr(" |-|/", names(gl)), length)), 
    xlab = "Number of words in keywords",
    ylab = "Number of keywords"
)

  1. Numbers of characters in keywords:
plot(table(nchar(names(gl))), 
    xlab = "Number of characters in keywords",
    ylab = "Number of keywords"
)