Skip to contents

UniProt database provides a list of controlled vocabulary represented as keywords for genes or proteins (https://www.uniprot.org/keywords/). This is useful for summarizing gene functions in a compact way.

The UniProtKeywords package

First load the package:

The release and source information of the data:

UniProtKeywords
## UniProt Keywords
##   Release: 2023_04 of 13-Sep-2023 
##   Source: https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/keywlist.txt 
##   Number of keywords: 1201 
##   Built date: 2023-09-23

The UniProtKeywords package has compiled genesets of keywords for some organism, which can get by the function load_keyword_genesets(). The argument is the taxon ID of a organism. The full set of supported organisms can be found in the document of load_keyword_genesets() (or in the object UniProtKeywords:::ORGANISM).

gl = load_keyword_genesets(9606)
gl[sample(length(gl), 2)]
## $`Heparin-binding`
##  [1] "351"    "51816"  "9289"   "4043"   "27329"  "462"    "26"     "333"   
##  [9] "338"    "348"    "350"    "339366" "9510"   "9508"   "566"    "6359"  
## [17] "6368"   "6354"   "6355"   "3491"   "1490"   "1289"   "1305"   "1311"  
## [25] "84570"  "6372"   "2160"   "129804" "2200"   "2246"   "2247"   "2249"  
## [33] "2252"   "2254"   "143282" "2260"   "2263"   "2335"   "11167"  "5045"  
## [41] "80739"  "2660"   "5270"   "64388"  "1839"   "3053"   "3273"   "3617"  
## [49] "50939"  "5104"   "3730"   "3918"   "3990"   "9388"   "200879" "149998"
## [57] "4023"   "4053"   "56925"  "4192"   "8829"   "8828"   "26577"  "5197"  
## [65] "5196"   "5228"   "10631"  "5549"   "5553"   "400668" "5764"   "5792"  
## [73] "5802"   "6146"   "6159"   "284654" "340419" "84870"  "343637" "6288"  
## [81] "50964"  "4057"   "7057"   "7058"   "7422"   "7423"   "7448"   "51156" 
## 
## $`rRNA-binding`
##  [1] "284119" "55794"  "26284"  "22868"  "60493"  "55272"  "84678"  "51776" 
##  [9] "10436"  "79159"  "387338" "55037"  "81554"  "6135"   "6147"   "6167"  
## [17] "6125"   "6132"   "80135"  "6205"   "6222"   "6191"   "6192"   "140032"
## [25] "6203"   "51373"  "23107"

You can also use the name of the organism:

Argument as_table can be set to TRUE, then load_keyword_genesets() returns a two-column data frame.

tb = load_keyword_genesets(9606, as_table = TRUE)
head(tb)
##   keyword   gene
## 1  2Fe-2S   2230
## 2  2Fe-2S    316
## 3  2Fe-2S  55847
## 4  2Fe-2S 493856
## 5  2Fe-2S 284106
## 6  2Fe-2S  57019

Statistics

We can simply check some statistics.

  1. Sizes of keyword genesets:
plot(table(sapply(gl, length)), log = "x", 
    xlab = "Size of keyword genesets",
    ylab = "Number of keywords"
)

  1. Numbers of words in keywords:
plot(table(sapply(strsplit(names(gl), " |-|/"), length)), 
    xlab = "Number of words in keywords",
    ylab = "Number of keywords"
)

  1. Numbers of characters in keywords:
plot(table(nchar(names(gl))), 
    xlab = "Number of characters in keywords",
    ylab = "Number of keywords"
)

Practice

Practice 1

What are the keywords with more than 2000 genes or 35 characters?

len = sapply(gl, length)
len[len > 2000]
##              3D-structure               Acetylation      Alternative splicing 
##                      7617                      3305                     10018 
##               Coiled coil Direct protein sequencing           Disease variant 
##                      2017                      2762                      3603 
##            Disulfide bond                  Membrane            Phosphoprotein 
##                      3273                      5177                      7922 
##        Reference proteome                    Repeat                    Signal 
##                     18125                      4622                      3125 
##             Transcription             Transmembrane       Transmembrane helix 
##                      2286                      4797                      4793 
##           Ubl conjugation 
##                      2478
nc = nchar(names(gl))
names(gl)[nc > 35]
##  [1] "Activation of host autophagy by virus"                   
##  [2] "Branched-chain amino acid catabolism"                    
##  [3] "Complement activation lectin pathway"                    
##  [4] "Congenital disorder of glycosylation"                    
##  [5] "Congenital generalized lipodystrophy"                    
##  [6] "Congenital stationary night blindness"                   
##  [7] "Familial hemophagocytic lymphohistiocytosis"             
##  [8] "Hereditary nonpolyposis colorectal cancer"               
##  [9] "Inhibition of host innate immune response by virus"      
## [10] "Inhibition of host interferon signaling pathway by virus"
## [11] "Lacrimo-auriculo-dento-digital syndrome"                 
## [12] "Microtubular inwards viral transport"                    
## [13] "Progressive external ophthalmoplegia"                    
## [14] "Rhizomelic chondrodysplasia punctata"