The .gmt data format for gene sets

MSigDB defines a simple .gmt format for storing gene sets. It has the following format:

gene_set_1  gene_set_description    gene1   gene2   gene3
gene_set_2  gene_set_description    gene4   gene5
...

The .gmt format is used more and more for sharing gene set data, e.g. https://maayanlab.cloud/Enrichr/#libraries.

Let’s try to read a .gmt file. We use the .gmt file for the Hallmark gene set.

download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/h.all.v2023.2.Hs.symbols.gmt",
  destfile = "h.all.v2023.2.Hs.symbols.gmt")

Since it is not a table, we have to read the file by lines.

ln = readLines("h.all.v2023.2.Hs.symbols.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
gs[1:2]
## $HALLMARK_ADIPOGENESIS
##   [1] "ABCA1"    "ABCB8"    "ACAA2"    "ACADL"    "ACADM"    "ACADS"   
##   [7] "ACLY"     "ACO2"     "ACOX1"    "ADCY6"    "ADIG"     "ADIPOQ"  
##  [13] "ADIPOR2"  "AGPAT3"   "AIFM1"    "AK2"      "ALDH2"    "ALDOA"   
##  [19] "ANGPT1"   "ANGPTL4"  "APLP2"    "APOE"     "ARAF"     "ARL4A"   
##  [25] "ATL2"     "ATP1B3"   "ATP5PO"   "BAZ2A"    "BCKDHA"   "BCL2L13" 
##  [31] "BCL6"     "C3"       "CAT"      "CAVIN1"   "CAVIN2"   "CCNG2"   
##  [37] "CD151"    "CD302"    "CD36"     "CDKN2C"   "CHCHD10"  "CHUK"    
##  [43] "CIDEA"    "CMBL"     "CMPK1"    "COL15A1"  "COL4A1"   "COQ3"    
##  [49] "COQ5"     "COQ9"     "COX6A1"   "COX7B"    "COX8A"    "CPT2"    
##  [55] "CRAT"     "CS"       "CYC1"     "CYP4B1"   "DBT"      "DDT"     
##  [61] "DECR1"    "DGAT1"    "DHCR7"    "DHRS7"    "DHRS7B"   "DLAT"    
##  [67] "DLD"      "DNAJB9"   "DNAJC15"  "DRAM2"    "ECH1"     "ECHS1"   
##  [73] "ELMOD3"   "ELOVL6"   "ENPP2"    "EPHX2"    "ESRRA"    "ESYT1"   
##  [79] "ETFB"     "FABP4"    "FAH"      "FZD4"     "G3BP2"    "GADD45A" 
##  [85] "GBE1"     "GHITM"    "GPAM"     "GPAT4"    "GPD2"     "GPHN"    
##  [91] "GPX3"     "GPX4"     "GRPEL1"   "HADH"     "HIBCH"    "HSPB8"   
##  [97] "IDH1"     "IDH3A"    "IDH3G"    "IFNGR1"   "IMMT"     "ITGA7"   
## [103] "ITIH5"    "ITSN1"    "JAGN1"    "LAMA4"    "LEP"      "LIFR"    
## [109] "LIPE"     "LPCAT3"   "LPL"      "LTC4S"    "MAP4K3"   "MCCC1"   
## [115] "MDH2"     "ME1"      "MGLL"     "MGST3"    "MIGA2"    "MRAP"    
## [121] "MRPL15"   "MTARC2"   "MTCH2"    "MYLK"     "NABP1"    "NDUFA5"  
## [127] "NDUFAB1"  "NDUFB7"   "NDUFS3"   "NKIRAS1"  "NMT1"     "OMD"     
## [133] "ORM1"     "PDCD4"    "PEMT"     "PEX14"    "PFKFB3"   "PFKL"    
## [139] "PGM1"     "PHLDB1"   "PHYH"     "PIM3"     "PLIN2"    "POR"     
## [145] "PPARG"    "PPM1B"    "PPP1R15B" "PRDX3"    "PREB"     "PTCD3"   
## [151] "PTGER3"   "QDPR"     "RAB34"    "REEP5"    "REEP6"    "RETN"    
## [157] "RETSAT"   "RIOK3"    "RMDN3"    "RNF11"    "RREB1"    "RTN3"    
## [163] "SAMM50"   "SCARB1"   "SCP2"     "SDHB"     "SDHC"     "SLC19A1" 
## [169] "SLC1A5"   "SLC25A1"  "SLC25A10" "SLC27A1"  "SLC5A6"   "SLC66A3" 
## [175] "SNCG"     "SOD1"     "SORBS1"   "SOWAHC"   "SPARCL1"  "SQOR"    
## [181] "SSPN"     "STAT5A"   "STOM"     "SUCLG1"   "SULT1A1"  "TALDO1"  
## [187] "TANK"     "TKT"      "TOB1"     "TST"      "UBC"      "UBQLN1"  
## [193] "UCK1"     "UCP2"     "UQCR10"   "UQCR11"   "UQCRC1"   "UQCRQ"   
## [199] "VEGFB"    "YWHAG"   
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "AARS1"    "ABCE1"    "ABI1"     "ACHE"     "ACVR2A"   "AKT1"    
##   [7] "APBB1"    "B2M"      "BCAT1"    "BCL10"    "BCL3"     "BRCA1"   
##  [13] "C2"       "CAPG"     "CARTPT"   "CCL11"    "CCL13"    "CCL19"   
##  [19] "CCL2"     "CCL22"    "CCL4"     "CCL5"     "CCL7"     "CCND2"   
##  [25] "CCND3"    "CCR1"     "CCR2"     "CCR5"     "CD1D"     "CD2"     
##  [31] "CD247"    "CD28"     "CD3D"     "CD3E"     "CD3G"     "CD4"     
##  [37] "CD40"     "CD40LG"   "CD47"     "CD7"      "CD74"     "CD79A"   
##  [43] "CD80"     "CD86"     "CD8A"     "CD8B"     "CD96"     "CDKN2A"  
##  [49] "CFP"      "CRTAM"    "CSF1"     "CSK"      "CTSS"     "CXCL13"  
##  [55] "CXCL9"    "CXCR3"    "DARS1"    "DEGS1"    "DYRK3"    "EGFR"    
##  [61] "EIF3A"    "EIF3D"    "EIF3J"    "EIF4G3"   "EIF5A"    "ELANE"   
##  [67] "ELF4"     "EREG"     "ETS1"     "F2"       "F2R"      "FAS"     
##  [73] "FASLG"    "FCGR2B"   "FGR"      "FLNA"     "FYB1"     "GALNT1"  
##  [79] "GBP2"     "GCNT1"    "GLMN"     "GPR65"    "GZMA"     "GZMB"    
##  [85] "HCLS1"    "HDAC9"    "HIF1A"    "HLA-A"    "HLA-DMA"  "HLA-DMB" 
##  [91] "HLA-DOA"  "HLA-DOB"  "HLA-DQA1" "HLA-DRA"  "HLA-E"    "HLA-G"   
##  [97] "ICAM1"    "ICOSLG"   "IFNAR2"   "IFNG"     "IFNGR1"   "IFNGR2"  
## [103] "IGSF6"    "IKBKB"    "IL10"     "IL11"     "IL12A"    "IL12B"   
## [109] "IL12RB1"  "IL13"     "IL15"     "IL16"     "IL18"     "IL18RAP" 
## [115] "IL1B"     "IL2"      "IL27RA"   "IL2RA"    "IL2RB"    "IL2RG"   
## [121] "IL4"      "IL4R"     "IL6"      "IL7"      "IL9"      "INHBA"   
## [127] "INHBB"    "IRF4"     "IRF7"     "IRF8"     "ITGAL"    "ITGB2"   
## [133] "ITK"      "JAK2"     "KLRD1"    "KRT1"     "LCK"      "LCP2"    
## [139] "LIF"      "LTB"      "LY75"     "LY86"     "LYN"      "MAP3K7"  
## [145] "MAP4K1"   "MBL2"     "MMP9"     "MRPL3"    "MTIF2"    "NCF4"    
## [151] "NCK1"     "NCR1"     "NLRP3"    "NME1"     "NOS2"     "NPM1"    
## [157] "PF4"      "PRF1"     "PRKCB"    "PRKCG"    "PSMB10"   "PTPN6"   
## [163] "PTPRC"    "RARS1"    "RIPK2"    "RPL39"    "RPL3L"    "RPL9"    
## [169] "RPS19"    "RPS3A"    "RPS9"     "SIT1"     "SOCS1"    "SOCS5"   
## [175] "SPI1"     "SRGN"     "ST8SIA4"  "STAB1"    "STAT1"    "STAT4"   
## [181] "TAP1"     "TAP2"     "TAPBP"    "TGFB1"    "TGFB2"    "THY1"    
## [187] "TIMP1"    "TLR1"     "TLR2"     "TLR3"     "TLR6"     "TNF"     
## [193] "TPD52"    "TRAF2"    "TRAT1"    "UBE2D1"   "UBE2N"    "WARS1"   
## [199] "WAS"      "ZAP70"

In rGREAT package, there is a helper function read_gmt() which can read gene sets from .gmt file and also does gene ID conversion.

## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.3.2
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## 
## 
## ========================================
## rGREAT version 2.5.1
## Bioconductor page: http://bioconductor.org/packages/rGREAT/
## Github page: https://github.com/jokergoo/rGREAT
## 
## If you use it in published research, please cite:
## Gu, Z. rGREAT: an R/Bioconductor package for functional enrichment 
##   on genomic regions. Bioinformatics 2023.
## 
## This message can be suppressed by:
##   suppressPackageStartupMessages(library(rGREAT))
## 
## Note: On Aug 19 2019 GREAT released version 4 where it supports `hg38`
## genome and removes some ontologies such pathways. `submitGreatJob()`
## still takes `hg19` as default. `hg38` can be specified by the `species
## = 'hg38'` argument. To use the older versions such as 3.0.0, specify as
## `submitGreatJob(..., version = '3.0.0')`.
## 
## From rGREAT version 1.99.0, it also implements the GREAT algorithm and
## it allows to integrate GREAT analysis with the Bioconductor annotation
## ecosystem. By default it supports more than 500 organisms. Check the
## new function `great()` and the new vignette.
## ========================================
gs = read_gmt("h.all.v2023.2.Hs.symbols.gmt")

gs2 = read_gmt("h.all.v2023.2.Hs.symbols.gmt", 
    from = "SYMBOL", to = "ENSEMBL", orgdb = "org.Hs.eg.db")
gs2[1:2]
## $HALLMARK_ADIPOGENESIS
##   [1] "ENSG00000165029" "ENSG00000197150" "ENSG00000167315" "ENSG00000115361"
##   [5] "ENSG00000117054" "ENSG00000122971" "ENSG00000131473" "ENSG00000100412"
##   [9] "ENSG00000161533" "ENSG00000174233" "ENSG00000182035" "ENSG00000181092"
##  [13] "ENSG00000160216" "ENSG00000156709" "ENSG00000004455" "ENSG00000111275"
##  [17] "ENSG00000149925" "ENSG00000154188" "ENSG00000167772" "ENSG00000084234"
##  [21] "ENSG00000130203" "ENSG00000078061" "ENSG00000122644" "ENSG00000119787"
##  [25] "ENSG00000069849" "ENSG00000241837" "ENSG00000076108" "ENSG00000248098"
##  [29] "ENSG00000099968" "ENSG00000113916" "ENSG00000125730" "ENSG00000121691"
##  [33] "ENSG00000177469" "ENSG00000168497" "ENSG00000138764" "ENSG00000177697"
##  [37] "ENSG00000241399" "ENSG00000135218" "ENSG00000123080" "ENSG00000213341"
##  [41] "ENSG00000176194" "ENSG00000164237" "ENSG00000162368" "ENSG00000204291"
##  [45] "ENSG00000187498" "ENSG00000132423" "ENSG00000110871" "ENSG00000088682"
##  [49] "ENSG00000111775" "ENSG00000131174" "ENSG00000176340" "ENSG00000157184"
##  [53] "ENSG00000095321" "ENSG00000062485" "ENSG00000179091" "ENSG00000142973"
##  [57] "ENSG00000137992" "ENSG00000104325" "ENSG00000172893" "ENSG00000100612"
##  [61] "ENSG00000109016" "ENSG00000150768" "ENSG00000091140" "ENSG00000128590"
##  [65] "ENSG00000120675" "ENSG00000156171" "ENSG00000127884" "ENSG00000115459"
##  [69] "ENSG00000170522" "ENSG00000136960" "ENSG00000120915" "ENSG00000173153"
##  [73] "ENSG00000139641" "ENSG00000105379" "ENSG00000170323" "ENSG00000103876"
##  [77] "ENSG00000174804" "ENSG00000138757" "ENSG00000116717" "ENSG00000114480"
##  [81] "ENSG00000165678" "ENSG00000119927" "ENSG00000158669" "ENSG00000115159"
##  [85] "ENSG00000171723" "ENSG00000211445" "ENSG00000167468" "ENSG00000109519"
##  [89] "ENSG00000138796" "ENSG00000198130" "ENSG00000152137" "ENSG00000138413"
##  [93] "ENSG00000166411" "ENSG00000067829" "ENSG00000027697" "ENSG00000132305"
##  [97] "ENSG00000135424" "ENSG00000123243" "ENSG00000205726" "ENSG00000171135"
## [101] "ENSG00000112769" "ENSG00000174697" "ENSG00000113594" "ENSG00000079435"
## [105] "ENSG00000111684" "ENSG00000175445" "ENSG00000011566" "ENSG00000078070"
## [109] "ENSG00000146701" "ENSG00000065833" "ENSG00000074416" "ENSG00000143198"
## [113] "ENSG00000148343" "ENSG00000170262" "ENSG00000137547" "ENSG00000117791"
## [117] "ENSG00000065534" "ENSG00000173559" "ENSG00000128609" "ENSG00000004779"
## [121] "ENSG00000099795" "ENSG00000197885" "ENSG00000136448" "ENSG00000127083"
## [125] "ENSG00000229314" "ENSG00000150593" "ENSG00000133027" "ENSG00000142655"
## [129] "ENSG00000170525" "ENSG00000141959" "ENSG00000079739" "ENSG00000019144"
## [133] "ENSG00000107537" "ENSG00000198355" "ENSG00000147872" "ENSG00000127948"
## [137] "ENSG00000132170" "ENSG00000138032" "ENSG00000158615" "ENSG00000165672"
## [141] "ENSG00000138073" "ENSG00000132300" "ENSG00000050628" "ENSG00000151552"
## [145] "ENSG00000109113" "ENSG00000129625" "ENSG00000115255" "ENSG00000104918"
## [149] "ENSG00000042445" "ENSG00000101782" "ENSG00000137824" "ENSG00000123091"
## [153] "ENSG00000124782" "ENSG00000133318" "ENSG00000100347" "ENSG00000073060"
## [157] "ENSG00000116171" "ENSG00000117118" "ENSG00000143252" "ENSG00000173638"
## [161] "ENSG00000105281" "ENSG00000100075" "ENSG00000183048" "ENSG00000130304"
## [165] "ENSG00000138074" "ENSG00000162976" "ENSG00000173267" "ENSG00000142168"
## [169] "ENSG00000095637" "ENSG00000198142" "ENSG00000152583" "ENSG00000137767"
## [173] "ENSG00000123096" "ENSG00000126561" "ENSG00000148175" "ENSG00000163541"
## [177] "ENSG00000177156" "ENSG00000136560" "ENSG00000163931" "ENSG00000141232"
## [181] "ENSG00000128311" "ENSG00000150991" "ENSG00000135018" "ENSG00000130717"
## [185] "ENSG00000175567" "ENSG00000184076" "ENSG00000127540" "ENSG00000010256"
## [189] "ENSG00000164405" "ENSG00000173511" "ENSG00000170027"
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "ENSG00000090861" "ENSG00000164163" "ENSG00000136754" "ENSG00000087085"
##   [5] "ENSG00000121989" "ENSG00000142208" "ENSG00000166313" "ENSG00000060982"
##   [9] "ENSG00000142867" "ENSG00000069399" "ENSG00000012048" "ENSG00000042493"
##  [13] "ENSG00000164326" "ENSG00000172156" "ENSG00000181374" "ENSG00000172724"
##  [17] "ENSG00000108691" "ENSG00000102962" "ENSG00000108688" "ENSG00000118971"
##  [21] "ENSG00000112576" "ENSG00000163823" "ENSG00000121807" "ENSG00000160791"
##  [25] "ENSG00000158473" "ENSG00000116824" "ENSG00000198821" "ENSG00000178562"
##  [29] "ENSG00000167286" "ENSG00000198851" "ENSG00000160654" "ENSG00000010610"
##  [33] "ENSG00000101017" "ENSG00000102245" "ENSG00000196776" "ENSG00000173762"
##  [37] "ENSG00000019582" "ENSG00000105369" "ENSG00000121594" "ENSG00000114013"
##  [41] "ENSG00000153563" "ENSG00000172116" "ENSG00000153283" "ENSG00000147889"
##  [45] "ENSG00000126759" "ENSG00000109943" "ENSG00000184371" "ENSG00000103653"
##  [49] "ENSG00000163131" "ENSG00000156234" "ENSG00000138755" "ENSG00000186810"
##  [53] "ENSG00000115866" "ENSG00000143753" "ENSG00000143479" "ENSG00000146648"
##  [57] "ENSG00000170099" "ENSG00000100353" "ENSG00000104131" "ENSG00000075151"
##  [61] "ENSG00000204472" "ENSG00000102034" "ENSG00000124882" "ENSG00000134954"
##  [65] "ENSG00000180210" "ENSG00000181104" "ENSG00000026103" "ENSG00000117560"
##  [69] "ENSG00000072694" "ENSG00000000938" "ENSG00000196924" "ENSG00000082074"
##  [73] "ENSG00000141429" "ENSG00000162645" "ENSG00000187210" "ENSG00000174842"
##  [77] "ENSG00000140030" "ENSG00000145649" "ENSG00000100453" "ENSG00000180353"
##  [81] "ENSG00000048052" "ENSG00000100644" "ENSG00000090339" "ENSG00000160223"
##  [85] "ENSG00000159110" "ENSG00000111537" "ENSG00000027697" "ENSG00000104365"
##  [89] "ENSG00000136634" "ENSG00000095752" "ENSG00000168811" "ENSG00000113302"
##  [93] "ENSG00000096996" "ENSG00000169194" "ENSG00000164136" "ENSG00000172349"
##  [97] "ENSG00000150782" "ENSG00000115607" "ENSG00000125538" "ENSG00000109471"
## [101] "ENSG00000134460" "ENSG00000100385" "ENSG00000147168" "ENSG00000113520"
## [105] "ENSG00000077238" "ENSG00000136244" "ENSG00000104432" "ENSG00000145839"
## [109] "ENSG00000122641" "ENSG00000163083" "ENSG00000137265" "ENSG00000140968"
## [113] "ENSG00000005844" "ENSG00000160255" "ENSG00000113263" "ENSG00000096968"
## [117] "ENSG00000134539" "ENSG00000167768" "ENSG00000182866" "ENSG00000043462"
## [121] "ENSG00000128342" "ENSG00000054219" "ENSG00000112799" "ENSG00000254087"
## [125] "ENSG00000135341" "ENSG00000165471" "ENSG00000100985" "ENSG00000114686"
## [129] "ENSG00000085760" "ENSG00000158092" "ENSG00000162711" "ENSG00000239672"
## [133] "ENSG00000007171" "ENSG00000181163" "ENSG00000163737" "ENSG00000180644"
## [137] "ENSG00000166501" "ENSG00000126583" "ENSG00000205220" "ENSG00000111679"
## [141] "ENSG00000113643" "ENSG00000104312" "ENSG00000198918" "ENSG00000140986"
## [145] "ENSG00000163682" "ENSG00000105372" "ENSG00000145425" "ENSG00000137078"
## [149] "ENSG00000185338" "ENSG00000171150" "ENSG00000066336" "ENSG00000122862"
## [153] "ENSG00000113532" "ENSG00000010327" "ENSG00000115415" "ENSG00000138378"
## [157] "ENSG00000105329" "ENSG00000092969" "ENSG00000154096" "ENSG00000102265"
## [161] "ENSG00000174125" "ENSG00000137462" "ENSG00000164342" "ENSG00000174130"
## [165] "ENSG00000076554" "ENSG00000127191" "ENSG00000163519" "ENSG00000072401"
## [169] "ENSG00000177889" "ENSG00000140105" "ENSG00000015285" "ENSG00000115085"

The msigdbr package

There is already an R package msigdbr which has integerated all MSigDB gene sets.

The major advantage is, MSigDB only supports human and mouse, msigdbr supports more organisms by mapping to the orthologe genes.

## # A tibble: 20 × 2
##    species_name                    species_common_name                          
##    <chr>                           <chr>                                        
##  1 Anolis carolinensis             Carolina anole, green anole                  
##  2 Bos taurus                      bovine, cattle, cow, dairy cow, domestic cat…
##  3 Caenorhabditis elegans          NA                                           
##  4 Canis lupus familiaris          dog, dogs                                    
##  5 Danio rerio                     leopard danio, zebra danio, zebra fish, zebr…
##  6 Drosophila melanogaster         fruit fly                                    
##  7 Equus caballus                  domestic horse, equine, horse                
##  8 Felis catus                     cat, cats, domestic cat                      
##  9 Gallus gallus                   bantam, chicken, chickens, Gallus domesticus 
## 10 Homo sapiens                    human                                        
## 11 Macaca mulatta                  rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica           gray short-tailed opossum                    
## 13 Mus musculus                    house mouse, mouse                           
## 14 Ornithorhynchus anatinus        duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes                 chimpanzee                                   
## 16 Rattus norvegicus               brown rat, Norway rat, rat, rats             
## 17 Saccharomyces cerevisiae        baker's yeast, brewer's yeast, S. cerevisiae 
## 18 Schizosaccharomyces pombe 972h- NA                                           
## 19 Sus scrofa                      pig, pigs, swine, wild boar                  
## 20 Xenopus tropicalis              tropical clawed frog, western clawed frog

All categories of gene sets:

##    gs_cat       gs_subcat num_genesets
## 1      C1                          299
## 2      C2             CGP         3384
## 3      C2              CP           29
## 4      C2     CP:BIOCARTA          292
## 5      C2         CP:KEGG          186
## 6      C2          CP:PID          196
## 7      C2     CP:REACTOME         1615
## 8      C2 CP:WIKIPATHWAYS          664
## 9      C3       MIR:MIRDB         2377
## 10     C3  MIR:MIR_Legacy          221
## 11     C3        TFT:GTRD          518
## 12     C3  TFT:TFT_Legacy          610
## 13     C4             CGN          427
## 14     C4              CM          431
## 15     C5           GO:BP         7658
## 16     C5           GO:CC         1006
## 17     C5           GO:MF         1738
## 18     C5             HPO         5071
## 19     C6                          189
## 20     C7     IMMUNESIGDB         4872
## 21     C7             VAX          347
## 22     C8                          700
## 23      H                           50

Values in the gs_cat column and the gs_subcat column can be used to retrieve gene sets in a specific category. E.g., we want to extract genesets in C2 category and CP:KEGG sub-category:

gene_sets = msigdbr(category = "C2", subcategory = "CP:KEGG")
gene_sets
## # A tibble: 16,283 × 15
##    gs_cat gs_subcat gs_name               gene_symbol entrez_gene ensembl_gene  
##    <chr>  <chr>     <chr>                 <chr>             <int> <chr>         
##  1 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA1                19 ENSG000001650…
##  2 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA10            10349 ENSG000001542…
##  3 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA12            26154 ENSG000001444…
##  4 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA13           154664 ENSG000001798…
##  5 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA2                20 ENSG000001073…
##  6 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA3                21 ENSG000001679…
##  7 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA4                24 ENSG000001986…
##  8 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA5             23461 ENSG000001542…
##  9 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA6             23460 ENSG000001542…
## 10 C2     CP:KEGG   KEGG_ABC_TRANSPORTERS ABCA7             10347 ENSG000000646…
## # ℹ 16,273 more rows
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>

If you don’t like the tibble format, you can change it to a data frame.

gene_sets = as.data.frame(gene_sets)
head(gene_sets)
##   gs_cat gs_subcat               gs_name gene_symbol entrez_gene
## 1     C2   CP:KEGG KEGG_ABC_TRANSPORTERS       ABCA1          19
## 2     C2   CP:KEGG KEGG_ABC_TRANSPORTERS      ABCA10       10349
## 3     C2   CP:KEGG KEGG_ABC_TRANSPORTERS      ABCA12       26154
## 4     C2   CP:KEGG KEGG_ABC_TRANSPORTERS      ABCA13      154664
## 5     C2   CP:KEGG KEGG_ABC_TRANSPORTERS       ABCA2          20
## 6     C2   CP:KEGG KEGG_ABC_TRANSPORTERS       ABCA3          21
##      ensembl_gene human_gene_symbol human_entrez_gene human_ensembl_gene  gs_id
## 1 ENSG00000165029             ABCA1                19    ENSG00000165029 M11911
## 2 ENSG00000154263            ABCA10             10349    ENSG00000154263 M11911
## 3 ENSG00000144452            ABCA12             26154    ENSG00000144452 M11911
## 4 ENSG00000179869            ABCA13            154664    ENSG00000179869 M11911
## 5 ENSG00000107331             ABCA2                20    ENSG00000107331 M11911
## 6 ENSG00000167972             ABCA3                21    ENSG00000167972 M11911
##   gs_pmid gs_geoid gs_exact_source
## 1                         hsa02010
## 2                         hsa02010
## 3                         hsa02010
## 4                         hsa02010
## 5                         hsa02010
## 6                         hsa02010
##                                                gs_url   gs_description
## 1 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 2 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 3 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 4 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 5 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 6 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters

Danger zone: The entrez_gene column is saved in integer. This may cause unseeable problems. Convert it explicitely to character: gene_set$entrez_gene = as.character(gene_set$entrez_gene).

Another advantage of misgdbr is that it provides all three major gene ID types. For example, if you need Ensembl gene IDs, you can obtain the gene sets by:

gene_sets[, c("gs_name", "ensembl_gene")] |> head()
##                 gs_name    ensembl_gene
## 1 KEGG_ABC_TRANSPORTERS ENSG00000165029
## 2 KEGG_ABC_TRANSPORTERS ENSG00000154263
## 3 KEGG_ABC_TRANSPORTERS ENSG00000144452
## 4 KEGG_ABC_TRANSPORTERS ENSG00000179869
## 5 KEGG_ABC_TRANSPORTERS ENSG00000107331
## 6 KEGG_ABC_TRANSPORTERS ENSG00000167972

By default it retrieves data for human. The species argument can be used for other supported species.

gene_sets = msigdbr(species = "dog", category = "C2", subcategory = "CP:BIOCARTA")
gene_sets = as.data.frame(gene_sets)
head(gene_sets)
##   gs_cat   gs_subcat               gs_name gene_symbol entrez_gene
## 1     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY        ATF2      478806
## 2     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY        CHUK      477796
## 3     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY        IFNG      403801
## 4     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY       IKBKB      482839
## 5     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY         IL2      403989
## 6     C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY         IL4      403785
##         ensembl_gene human_gene_symbol human_entrez_gene human_ensembl_gene
## 1 ENSCAFG00845017708              ATF2              1386    ENSG00000115966
## 2 ENSCAFG00845016257              CHUK              1147    ENSG00000213341
## 3 ENSCAFG00845013680              IFNG              3458    ENSG00000111537
## 4 ENSCAFG00845006197             IKBKB              3551    ENSG00000104365
## 5 ENSCAFG00845020168               IL2              3558    ENSG00000109471
## 6 ENSCAFG00845010991               IL4              3565    ENSG00000113520
##   gs_id gs_pmid gs_geoid gs_exact_source
## 1 M2064                                 
## 2 M2064                                 
## 3 M2064                                 
## 4 M2064                                 
## 5 M2064                                 
## 6 M2064                                 
##                                                                                gs_url
## 1 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 2 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 3 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 4 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 5 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 6 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
##                        gs_description taxon_id
## 1 The 4-1BB-dependent immune response     9615
## 2 The 4-1BB-dependent immune response     9615
## 3 The 4-1BB-dependent immune response     9615
## 4 The 4-1BB-dependent immune response     9615
## 5 The 4-1BB-dependent immune response     9615
## 6 The 4-1BB-dependent immune response     9615
##                             ortholog_sources num_ortholog_sources
## 1 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB                    5
## 2 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB                    5
## 3 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB                    5
## 4 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB                    5
## 5         Ensembl|HomoloGene|Inparanoid|NCBI                    4
## 6 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB                    5

A safer check

The table contains mappings between gene sets and all the three gene ID types. Since Symbol-EntreZ-Ensembl mapping is not 1:1, it is possible rows are duplicated if only taking two columns from the complete table:

gene_sets = msigdbr(category = "C5", subcategory = "GO:BP")

tb = gene_sets[, c("gs_name", "gene_symbol")]
dim(tb)
## [1] 721379      2
dim(unique(tb))
## [1] 633008      2

So, additionally applying unique() is always safe.

Functions in the GSEAtraining package

We also have several functions in GSEAtraining to retrieve gene sets from MSigDB. It directly downloads data from MSigDB web server.

##  [1] "6.0"       "6.1"       "6.2"       "7.0"       "7.1"       "7.2"      
##  [7] "7.3"       "7.4"       "7.5.1"     "7.5"       "2022.1.Hs" "2022.1.Mm"
## [13] "2023.1.Hs" "2023.1.Mm" "2023.2.Hs" "2023.2.Mm"
##  [1] "c1.all"             "c2.all"             "c2.cgp"            
##  [4] "c2.cp.biocarta"     "c2.cp.kegg_legacy"  "c2.cp.kegg_medicus"
##  [7] "c2.cp.pid"          "c2.cp.reactome"     "c2.cp"             
## [10] "c2.cp.wikipathways" "c3.all"             "c3.mir.mir_legacy" 
## [13] "c3.mir.mirdb"       "c3.mir"             "c3.tft.gtrd"       
## [16] "c3.tft.tft_legacy"  "c3.tft"             "c4.3ca"            
## [19] "c4.all"             "c4.cgn"             "c4.cm"             
## [22] "c5.all"             "c5.go.bp"           "c5.go.cc"          
## [25] "c5.go.mf"           "c5.go"              "c5.hpo"            
## [28] "c6.all"             "c7.all"             "c7.immunesigdb"    
## [31] "c7.vax"             "c8.all"             "h.all"
lt = get_msigdb(version = "2023.2.Hs", collection = "h.all")
lt[1:2]
## $HALLMARK_ADIPOGENESIS
##   [1] "19"     "11194"  "10449"  "33"     "34"     "35"     "47"     "50"    
##   [9] "51"     "112"    "149685" "9370"   "79602"  "56894"  "9131"   "204"   
##  [17] "217"    "226"    "284"    "51129"  "334"    "348"    "369"    "10124" 
##  [25] "64225"  "483"    "539"    "11176"  "593"    "23786"  "604"    "718"   
##  [33] "847"    "284119" "8436"   "901"    "977"    "9936"   "948"    "1031"  
##  [41] "400916" "1147"   "1149"   "134147" "51727"  "1306"   "1282"   "51805" 
##  [49] "84274"  "57017"  "1337"   "1349"   "1351"   "1376"   "1384"   "1431"  
##  [57] "1537"   "1580"   "1629"   "1652"   "1666"   "8694"   "1717"   "51635" 
##  [65] "25979"  "1737"   "1738"   "4189"   "29103"  "128338" "1891"   "1892"  
##  [73] "84173"  "79071"  "5168"   "2053"   "2101"   "23344"  "2109"   "2167"  
##  [81] "2184"   "8322"   "9908"   "1647"   "2632"   "27069"  "57678"  "137964"
##  [89] "2820"   "10243"  "2878"   "2879"   "80273"  "3033"   "26275"  "26353" 
##  [97] "3417"   "3419"   "3421"   "3459"   "10989"  "3679"   "80760"  "6453"  
## [105] "84522"  "3910"   "3952"   "3977"   "3991"   "10162"  "4023"   "4056"  
## [113] "8491"   "56922"  "4191"   "4199"   "11343"  "4259"   "84895"  "56246" 
## [121] "29088"  "54996"  "23788"  "4638"   "64859"  "4698"   "4706"   "4713"  
## [129] "4722"   "28512"  "4836"   "4958"   "5004"   "27250"  "10400"  "5195"  
## [137] "5209"   "5211"   "5236"   "23187"  "5264"   "415116" "123"    "5447"  
## [145] "5468"   "5495"   "84919"  "10935"  "10113"  "55037"  "5733"   "5860"  
## [153] "83871"  "7905"   "92840"  "56729"  "54884"  "8780"   "55177"  "26994" 
## [161] "6239"   "10313"  "25813"  "949"    "6342"   "6390"   "6391"   "6573"  
## [169] "6510"   "6576"   "1468"   "376497" "8884"   "130814" "6623"   "6647"  
## [177] "10580"  "65124"  "8404"   "58472"  "8082"   "6776"   "2040"   "8802"  
## [185] "6817"   "6888"   "10010"  "7086"   "10140"  "7263"   "7316"   "29979" 
## [193] "83549"  "7351"   "29796"  "10975"  "7384"   "27089"  "7423"   "7532"  
## 
## $HALLMARK_ALLOGRAFT_REJECTION
##   [1] "16"     "6059"   "10006"  "43"     "92"     "207"    "322"    "567"   
##   [9] "586"    "8915"   "602"    "672"    "717"    "822"    "9607"   "6356"  
##  [17] "6357"   "6363"   "6347"   "6367"   "6351"   "6352"   "6354"   "894"   
##  [25] "896"    "1230"   "729230" "1234"   "912"    "914"    "919"    "940"   
##  [33] "915"    "916"    "917"    "920"    "958"    "959"    "961"    "924"   
##  [41] "972"    "973"    "941"    "942"    "925"    "926"    "10225"  "1029"  
##  [49] "5199"   "56253"  "1435"   "1445"   "1520"   "10563"  "4283"   "2833"  
##  [57] "1615"   "8560"   "8444"   "1956"   "8661"   "8664"   "8669"   "8672"  
##  [65] "1984"   "1991"   "2000"   "2069"   "2113"   "2147"   "2149"   "355"   
##  [73] "356"    "2213"   "2268"   "2316"   "2533"   "2589"   "2634"   "2650"  
##  [81] "11146"  "8477"   "3001"   "3002"   "3059"   "9734"   "3091"   "3105"  
##  [89] "3108"   "3109"   "3111"   "3112"   "3117"   "3122"   "3133"   "3135"  
##  [97] "3383"   "23308"  "3455"   "3458"   "3459"   "3460"   "10261"  "3551"  
## [105] "3586"   "3589"   "3592"   "3593"   "3594"   "3596"   "3600"   "3603"  
## [113] "3606"   "8807"   "3553"   "3558"   "9466"   "3559"   "3560"   "3561"  
## [121] "3565"   "3566"   "3569"   "3574"   "3578"   "3624"   "3625"   "3662"  
## [129] "3665"   "3394"   "3683"   "3689"   "3702"   "3717"   "3824"   "3848"  
## [137] "3932"   "3937"   "3976"   "4050"   "4065"   "9450"   "4067"   "6885"  
## [145] "11184"  "4153"   "4318"   "11222"  "4528"   "4689"   "4690"   "9437"  
## [153] "114548" "4830"   "4843"   "4869"   "5196"   "5551"   "5579"   "5582"  
## [161] "5699"   "5777"   "5788"   "5917"   "8767"   "6170"   "6123"   "6133"  
## [169] "6223"   "6189"   "6203"   "27240"  "8651"   "9655"   "6688"   "5552"  
## [177] "7903"   "23166"  "6772"   "6775"   "6890"   "6891"   "6892"   "7040"  
## [185] "7042"   "7070"   "7076"   "7096"   "7097"   "7098"   "10333"  "7124"  
## [193] "7163"   "7186"   "50852"  "7321"   "7334"   "7453"   "7454"   "7535"

Practice

Practice 1

The gene set resource on https://maayanlab.cloud/Enrichr/#libraries is very useful. You may want to use it some say in the future. Take one gene set collection (e.g. the COVID-19 related gene sets), download the corresponding gene set file (in .gmt format) and try to read into R as a list or a two-column data frame.

Solution

download.file("https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=COVID-19_Related_Gene_Sets", destfile = "covid-19.gmt")
ln = readLines("covid-19.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
gs[1:2]
## $`COVID19-E protein host PPI from Krogan`
## [1] "BRD4"    "BRD2"    "SLC44A2" "ZC3H18"  "AP3B1"   "CWC27"  
## 
## $`COVID19-M protein host PPI from Krogan`
##  [1] "AAR2"     "AASS"     "SLC30A7"  "SLC30A9"  "INTS4"    "SAAL1"   
##  [7] "ANO6"     "ATP1B1"   "YIF1A"    "REEP6"    "GGCX"     "REEP5"   
## [13] "COQ8B"    "TARS2"    "FAM8A1"   "ATP6V1A"  "RTN4"     "TUBGCP2" 
## [19] "TUBGCP3"  "AKAP8L"   "FASTKD5"  "ETFA"     "BZW2"     "PSMD8"   
## [25] "ACADM"    "PITRM1"   "STOM"     "PMPCB"    "PMPCA"    "SLC25A21"

Convert gs to a data frame

df = data.frame(
    geneset = rep(names(gs), times = sapply(gs, length)),
    gene = unlist(gs)
)
head(df)
##                                                                        geneset
## COVID19-E protein host PPI from Krogan1 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan2 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan3 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan4 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan5 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan6 COVID19-E protein host PPI from Krogan
##                                            gene
## COVID19-E protein host PPI from Krogan1    BRD4
## COVID19-E protein host PPI from Krogan2    BRD2
## COVID19-E protein host PPI from Krogan3 SLC44A2
## COVID19-E protein host PPI from Krogan4  ZC3H18
## COVID19-E protein host PPI from Krogan5   AP3B1
## COVID19-E protein host PPI from Krogan6   CWC27

Or use the list_to_dataframe() function in GSEAtraining:

df = list_to_dataframe(gs)
head(df)
##                                 gene_set    gene
## 1 COVID19-E protein host PPI from Krogan    BRD4
## 2 COVID19-E protein host PPI from Krogan    BRD2
## 3 COVID19-E protein host PPI from Krogan SLC44A2
## 4 COVID19-E protein host PPI from Krogan  ZC3H18
## 5 COVID19-E protein host PPI from Krogan   AP3B1
## 6 COVID19-E protein host PPI from Krogan   CWC27