vignettes/topic1_03_msigdbr.Rmd
topic1_03_msigdbr.Rmd
MSigDB defines a simple .gmt
format for storing gene sets. It has the following format:
gene_set_1 gene_set_description gene1 gene2 gene3
gene_set_2 gene_set_description gene4 gene5
...
The .gmt
format is used more and more for sharing gene set data, e.g. https://maayanlab.cloud/Enrichr/#libraries.
Let’s try to read a .gmt
file. We use the .gmt
file for the Hallmark gene set.
download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2023.2.Hs/h.all.v2023.2.Hs.symbols.gmt",
destfile = "h.all.v2023.2.Hs.symbols.gmt")
Since it is not a table, we have to read the file by lines.
ln = readLines("h.all.v2023.2.Hs.symbols.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
gs[1:2]
## $HALLMARK_ADIPOGENESIS
## [1] "ABCA1" "ABCB8" "ACAA2" "ACADL" "ACADM" "ACADS"
## [7] "ACLY" "ACO2" "ACOX1" "ADCY6" "ADIG" "ADIPOQ"
## [13] "ADIPOR2" "AGPAT3" "AIFM1" "AK2" "ALDH2" "ALDOA"
## [19] "ANGPT1" "ANGPTL4" "APLP2" "APOE" "ARAF" "ARL4A"
## [25] "ATL2" "ATP1B3" "ATP5PO" "BAZ2A" "BCKDHA" "BCL2L13"
## [31] "BCL6" "C3" "CAT" "CAVIN1" "CAVIN2" "CCNG2"
## [37] "CD151" "CD302" "CD36" "CDKN2C" "CHCHD10" "CHUK"
## [43] "CIDEA" "CMBL" "CMPK1" "COL15A1" "COL4A1" "COQ3"
## [49] "COQ5" "COQ9" "COX6A1" "COX7B" "COX8A" "CPT2"
## [55] "CRAT" "CS" "CYC1" "CYP4B1" "DBT" "DDT"
## [61] "DECR1" "DGAT1" "DHCR7" "DHRS7" "DHRS7B" "DLAT"
## [67] "DLD" "DNAJB9" "DNAJC15" "DRAM2" "ECH1" "ECHS1"
## [73] "ELMOD3" "ELOVL6" "ENPP2" "EPHX2" "ESRRA" "ESYT1"
## [79] "ETFB" "FABP4" "FAH" "FZD4" "G3BP2" "GADD45A"
## [85] "GBE1" "GHITM" "GPAM" "GPAT4" "GPD2" "GPHN"
## [91] "GPX3" "GPX4" "GRPEL1" "HADH" "HIBCH" "HSPB8"
## [97] "IDH1" "IDH3A" "IDH3G" "IFNGR1" "IMMT" "ITGA7"
## [103] "ITIH5" "ITSN1" "JAGN1" "LAMA4" "LEP" "LIFR"
## [109] "LIPE" "LPCAT3" "LPL" "LTC4S" "MAP4K3" "MCCC1"
## [115] "MDH2" "ME1" "MGLL" "MGST3" "MIGA2" "MRAP"
## [121] "MRPL15" "MTARC2" "MTCH2" "MYLK" "NABP1" "NDUFA5"
## [127] "NDUFAB1" "NDUFB7" "NDUFS3" "NKIRAS1" "NMT1" "OMD"
## [133] "ORM1" "PDCD4" "PEMT" "PEX14" "PFKFB3" "PFKL"
## [139] "PGM1" "PHLDB1" "PHYH" "PIM3" "PLIN2" "POR"
## [145] "PPARG" "PPM1B" "PPP1R15B" "PRDX3" "PREB" "PTCD3"
## [151] "PTGER3" "QDPR" "RAB34" "REEP5" "REEP6" "RETN"
## [157] "RETSAT" "RIOK3" "RMDN3" "RNF11" "RREB1" "RTN3"
## [163] "SAMM50" "SCARB1" "SCP2" "SDHB" "SDHC" "SLC19A1"
## [169] "SLC1A5" "SLC25A1" "SLC25A10" "SLC27A1" "SLC5A6" "SLC66A3"
## [175] "SNCG" "SOD1" "SORBS1" "SOWAHC" "SPARCL1" "SQOR"
## [181] "SSPN" "STAT5A" "STOM" "SUCLG1" "SULT1A1" "TALDO1"
## [187] "TANK" "TKT" "TOB1" "TST" "UBC" "UBQLN1"
## [193] "UCK1" "UCP2" "UQCR10" "UQCR11" "UQCRC1" "UQCRQ"
## [199] "VEGFB" "YWHAG"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "AARS1" "ABCE1" "ABI1" "ACHE" "ACVR2A" "AKT1"
## [7] "APBB1" "B2M" "BCAT1" "BCL10" "BCL3" "BRCA1"
## [13] "C2" "CAPG" "CARTPT" "CCL11" "CCL13" "CCL19"
## [19] "CCL2" "CCL22" "CCL4" "CCL5" "CCL7" "CCND2"
## [25] "CCND3" "CCR1" "CCR2" "CCR5" "CD1D" "CD2"
## [31] "CD247" "CD28" "CD3D" "CD3E" "CD3G" "CD4"
## [37] "CD40" "CD40LG" "CD47" "CD7" "CD74" "CD79A"
## [43] "CD80" "CD86" "CD8A" "CD8B" "CD96" "CDKN2A"
## [49] "CFP" "CRTAM" "CSF1" "CSK" "CTSS" "CXCL13"
## [55] "CXCL9" "CXCR3" "DARS1" "DEGS1" "DYRK3" "EGFR"
## [61] "EIF3A" "EIF3D" "EIF3J" "EIF4G3" "EIF5A" "ELANE"
## [67] "ELF4" "EREG" "ETS1" "F2" "F2R" "FAS"
## [73] "FASLG" "FCGR2B" "FGR" "FLNA" "FYB1" "GALNT1"
## [79] "GBP2" "GCNT1" "GLMN" "GPR65" "GZMA" "GZMB"
## [85] "HCLS1" "HDAC9" "HIF1A" "HLA-A" "HLA-DMA" "HLA-DMB"
## [91] "HLA-DOA" "HLA-DOB" "HLA-DQA1" "HLA-DRA" "HLA-E" "HLA-G"
## [97] "ICAM1" "ICOSLG" "IFNAR2" "IFNG" "IFNGR1" "IFNGR2"
## [103] "IGSF6" "IKBKB" "IL10" "IL11" "IL12A" "IL12B"
## [109] "IL12RB1" "IL13" "IL15" "IL16" "IL18" "IL18RAP"
## [115] "IL1B" "IL2" "IL27RA" "IL2RA" "IL2RB" "IL2RG"
## [121] "IL4" "IL4R" "IL6" "IL7" "IL9" "INHBA"
## [127] "INHBB" "IRF4" "IRF7" "IRF8" "ITGAL" "ITGB2"
## [133] "ITK" "JAK2" "KLRD1" "KRT1" "LCK" "LCP2"
## [139] "LIF" "LTB" "LY75" "LY86" "LYN" "MAP3K7"
## [145] "MAP4K1" "MBL2" "MMP9" "MRPL3" "MTIF2" "NCF4"
## [151] "NCK1" "NCR1" "NLRP3" "NME1" "NOS2" "NPM1"
## [157] "PF4" "PRF1" "PRKCB" "PRKCG" "PSMB10" "PTPN6"
## [163] "PTPRC" "RARS1" "RIPK2" "RPL39" "RPL3L" "RPL9"
## [169] "RPS19" "RPS3A" "RPS9" "SIT1" "SOCS1" "SOCS5"
## [175] "SPI1" "SRGN" "ST8SIA4" "STAB1" "STAT1" "STAT4"
## [181] "TAP1" "TAP2" "TAPBP" "TGFB1" "TGFB2" "THY1"
## [187] "TIMP1" "TLR1" "TLR2" "TLR3" "TLR6" "TNF"
## [193] "TPD52" "TRAF2" "TRAT1" "UBE2D1" "UBE2N" "WARS1"
## [199] "WAS" "ZAP70"
In rGREAT package, there is a helper function read_gmt()
which can read gene sets from .gmt
file and also does gene ID conversion.
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
## as.data.frame, basename, cbind, colnames, dirname, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
## lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
## tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.3.2
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
##
## findMatches
## The following objects are masked from 'package:base':
##
## I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
##
##
## ========================================
## rGREAT version 2.5.1
## Bioconductor page: http://bioconductor.org/packages/rGREAT/
## Github page: https://github.com/jokergoo/rGREAT
##
## If you use it in published research, please cite:
## Gu, Z. rGREAT: an R/Bioconductor package for functional enrichment
## on genomic regions. Bioinformatics 2023.
##
## This message can be suppressed by:
## suppressPackageStartupMessages(library(rGREAT))
##
## Note: On Aug 19 2019 GREAT released version 4 where it supports `hg38`
## genome and removes some ontologies such pathways. `submitGreatJob()`
## still takes `hg19` as default. `hg38` can be specified by the `species
## = 'hg38'` argument. To use the older versions such as 3.0.0, specify as
## `submitGreatJob(..., version = '3.0.0')`.
##
## From rGREAT version 1.99.0, it also implements the GREAT algorithm and
## it allows to integrate GREAT analysis with the Bioconductor annotation
## ecosystem. By default it supports more than 500 organisms. Check the
## new function `great()` and the new vignette.
## ========================================
gs = read_gmt("h.all.v2023.2.Hs.symbols.gmt")
gs2 = read_gmt("h.all.v2023.2.Hs.symbols.gmt",
from = "SYMBOL", to = "ENSEMBL", orgdb = "org.Hs.eg.db")
gs2[1:2]
## $HALLMARK_ADIPOGENESIS
## [1] "ENSG00000165029" "ENSG00000197150" "ENSG00000167315" "ENSG00000115361"
## [5] "ENSG00000117054" "ENSG00000122971" "ENSG00000131473" "ENSG00000100412"
## [9] "ENSG00000161533" "ENSG00000174233" "ENSG00000182035" "ENSG00000181092"
## [13] "ENSG00000160216" "ENSG00000156709" "ENSG00000004455" "ENSG00000111275"
## [17] "ENSG00000149925" "ENSG00000154188" "ENSG00000167772" "ENSG00000084234"
## [21] "ENSG00000130203" "ENSG00000078061" "ENSG00000122644" "ENSG00000119787"
## [25] "ENSG00000069849" "ENSG00000241837" "ENSG00000076108" "ENSG00000248098"
## [29] "ENSG00000099968" "ENSG00000113916" "ENSG00000125730" "ENSG00000121691"
## [33] "ENSG00000177469" "ENSG00000168497" "ENSG00000138764" "ENSG00000177697"
## [37] "ENSG00000241399" "ENSG00000135218" "ENSG00000123080" "ENSG00000213341"
## [41] "ENSG00000176194" "ENSG00000164237" "ENSG00000162368" "ENSG00000204291"
## [45] "ENSG00000187498" "ENSG00000132423" "ENSG00000110871" "ENSG00000088682"
## [49] "ENSG00000111775" "ENSG00000131174" "ENSG00000176340" "ENSG00000157184"
## [53] "ENSG00000095321" "ENSG00000062485" "ENSG00000179091" "ENSG00000142973"
## [57] "ENSG00000137992" "ENSG00000104325" "ENSG00000172893" "ENSG00000100612"
## [61] "ENSG00000109016" "ENSG00000150768" "ENSG00000091140" "ENSG00000128590"
## [65] "ENSG00000120675" "ENSG00000156171" "ENSG00000127884" "ENSG00000115459"
## [69] "ENSG00000170522" "ENSG00000136960" "ENSG00000120915" "ENSG00000173153"
## [73] "ENSG00000139641" "ENSG00000105379" "ENSG00000170323" "ENSG00000103876"
## [77] "ENSG00000174804" "ENSG00000138757" "ENSG00000116717" "ENSG00000114480"
## [81] "ENSG00000165678" "ENSG00000119927" "ENSG00000158669" "ENSG00000115159"
## [85] "ENSG00000171723" "ENSG00000211445" "ENSG00000167468" "ENSG00000109519"
## [89] "ENSG00000138796" "ENSG00000198130" "ENSG00000152137" "ENSG00000138413"
## [93] "ENSG00000166411" "ENSG00000067829" "ENSG00000027697" "ENSG00000132305"
## [97] "ENSG00000135424" "ENSG00000123243" "ENSG00000205726" "ENSG00000171135"
## [101] "ENSG00000112769" "ENSG00000174697" "ENSG00000113594" "ENSG00000079435"
## [105] "ENSG00000111684" "ENSG00000175445" "ENSG00000011566" "ENSG00000078070"
## [109] "ENSG00000146701" "ENSG00000065833" "ENSG00000074416" "ENSG00000143198"
## [113] "ENSG00000148343" "ENSG00000170262" "ENSG00000137547" "ENSG00000117791"
## [117] "ENSG00000065534" "ENSG00000173559" "ENSG00000128609" "ENSG00000004779"
## [121] "ENSG00000099795" "ENSG00000197885" "ENSG00000136448" "ENSG00000127083"
## [125] "ENSG00000229314" "ENSG00000150593" "ENSG00000133027" "ENSG00000142655"
## [129] "ENSG00000170525" "ENSG00000141959" "ENSG00000079739" "ENSG00000019144"
## [133] "ENSG00000107537" "ENSG00000198355" "ENSG00000147872" "ENSG00000127948"
## [137] "ENSG00000132170" "ENSG00000138032" "ENSG00000158615" "ENSG00000165672"
## [141] "ENSG00000138073" "ENSG00000132300" "ENSG00000050628" "ENSG00000151552"
## [145] "ENSG00000109113" "ENSG00000129625" "ENSG00000115255" "ENSG00000104918"
## [149] "ENSG00000042445" "ENSG00000101782" "ENSG00000137824" "ENSG00000123091"
## [153] "ENSG00000124782" "ENSG00000133318" "ENSG00000100347" "ENSG00000073060"
## [157] "ENSG00000116171" "ENSG00000117118" "ENSG00000143252" "ENSG00000173638"
## [161] "ENSG00000105281" "ENSG00000100075" "ENSG00000183048" "ENSG00000130304"
## [165] "ENSG00000138074" "ENSG00000162976" "ENSG00000173267" "ENSG00000142168"
## [169] "ENSG00000095637" "ENSG00000198142" "ENSG00000152583" "ENSG00000137767"
## [173] "ENSG00000123096" "ENSG00000126561" "ENSG00000148175" "ENSG00000163541"
## [177] "ENSG00000177156" "ENSG00000136560" "ENSG00000163931" "ENSG00000141232"
## [181] "ENSG00000128311" "ENSG00000150991" "ENSG00000135018" "ENSG00000130717"
## [185] "ENSG00000175567" "ENSG00000184076" "ENSG00000127540" "ENSG00000010256"
## [189] "ENSG00000164405" "ENSG00000173511" "ENSG00000170027"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "ENSG00000090861" "ENSG00000164163" "ENSG00000136754" "ENSG00000087085"
## [5] "ENSG00000121989" "ENSG00000142208" "ENSG00000166313" "ENSG00000060982"
## [9] "ENSG00000142867" "ENSG00000069399" "ENSG00000012048" "ENSG00000042493"
## [13] "ENSG00000164326" "ENSG00000172156" "ENSG00000181374" "ENSG00000172724"
## [17] "ENSG00000108691" "ENSG00000102962" "ENSG00000108688" "ENSG00000118971"
## [21] "ENSG00000112576" "ENSG00000163823" "ENSG00000121807" "ENSG00000160791"
## [25] "ENSG00000158473" "ENSG00000116824" "ENSG00000198821" "ENSG00000178562"
## [29] "ENSG00000167286" "ENSG00000198851" "ENSG00000160654" "ENSG00000010610"
## [33] "ENSG00000101017" "ENSG00000102245" "ENSG00000196776" "ENSG00000173762"
## [37] "ENSG00000019582" "ENSG00000105369" "ENSG00000121594" "ENSG00000114013"
## [41] "ENSG00000153563" "ENSG00000172116" "ENSG00000153283" "ENSG00000147889"
## [45] "ENSG00000126759" "ENSG00000109943" "ENSG00000184371" "ENSG00000103653"
## [49] "ENSG00000163131" "ENSG00000156234" "ENSG00000138755" "ENSG00000186810"
## [53] "ENSG00000115866" "ENSG00000143753" "ENSG00000143479" "ENSG00000146648"
## [57] "ENSG00000170099" "ENSG00000100353" "ENSG00000104131" "ENSG00000075151"
## [61] "ENSG00000204472" "ENSG00000102034" "ENSG00000124882" "ENSG00000134954"
## [65] "ENSG00000180210" "ENSG00000181104" "ENSG00000026103" "ENSG00000117560"
## [69] "ENSG00000072694" "ENSG00000000938" "ENSG00000196924" "ENSG00000082074"
## [73] "ENSG00000141429" "ENSG00000162645" "ENSG00000187210" "ENSG00000174842"
## [77] "ENSG00000140030" "ENSG00000145649" "ENSG00000100453" "ENSG00000180353"
## [81] "ENSG00000048052" "ENSG00000100644" "ENSG00000090339" "ENSG00000160223"
## [85] "ENSG00000159110" "ENSG00000111537" "ENSG00000027697" "ENSG00000104365"
## [89] "ENSG00000136634" "ENSG00000095752" "ENSG00000168811" "ENSG00000113302"
## [93] "ENSG00000096996" "ENSG00000169194" "ENSG00000164136" "ENSG00000172349"
## [97] "ENSG00000150782" "ENSG00000115607" "ENSG00000125538" "ENSG00000109471"
## [101] "ENSG00000134460" "ENSG00000100385" "ENSG00000147168" "ENSG00000113520"
## [105] "ENSG00000077238" "ENSG00000136244" "ENSG00000104432" "ENSG00000145839"
## [109] "ENSG00000122641" "ENSG00000163083" "ENSG00000137265" "ENSG00000140968"
## [113] "ENSG00000005844" "ENSG00000160255" "ENSG00000113263" "ENSG00000096968"
## [117] "ENSG00000134539" "ENSG00000167768" "ENSG00000182866" "ENSG00000043462"
## [121] "ENSG00000128342" "ENSG00000054219" "ENSG00000112799" "ENSG00000254087"
## [125] "ENSG00000135341" "ENSG00000165471" "ENSG00000100985" "ENSG00000114686"
## [129] "ENSG00000085760" "ENSG00000158092" "ENSG00000162711" "ENSG00000239672"
## [133] "ENSG00000007171" "ENSG00000181163" "ENSG00000163737" "ENSG00000180644"
## [137] "ENSG00000166501" "ENSG00000126583" "ENSG00000205220" "ENSG00000111679"
## [141] "ENSG00000113643" "ENSG00000104312" "ENSG00000198918" "ENSG00000140986"
## [145] "ENSG00000163682" "ENSG00000105372" "ENSG00000145425" "ENSG00000137078"
## [149] "ENSG00000185338" "ENSG00000171150" "ENSG00000066336" "ENSG00000122862"
## [153] "ENSG00000113532" "ENSG00000010327" "ENSG00000115415" "ENSG00000138378"
## [157] "ENSG00000105329" "ENSG00000092969" "ENSG00000154096" "ENSG00000102265"
## [161] "ENSG00000174125" "ENSG00000137462" "ENSG00000164342" "ENSG00000174130"
## [165] "ENSG00000076554" "ENSG00000127191" "ENSG00000163519" "ENSG00000072401"
## [169] "ENSG00000177889" "ENSG00000140105" "ENSG00000015285" "ENSG00000115085"
There is already an R package msigdbr which has integerated all MSigDB gene sets.
The major advantage is, MSigDB only supports human and mouse, msigdbr supports more organisms by mapping to the orthologe genes.
## # A tibble: 20 × 2
## species_name species_common_name
## <chr> <chr>
## 1 Anolis carolinensis Carolina anole, green anole
## 2 Bos taurus bovine, cattle, cow, dairy cow, domestic cat…
## 3 Caenorhabditis elegans NA
## 4 Canis lupus familiaris dog, dogs
## 5 Danio rerio leopard danio, zebra danio, zebra fish, zebr…
## 6 Drosophila melanogaster fruit fly
## 7 Equus caballus domestic horse, equine, horse
## 8 Felis catus cat, cats, domestic cat
## 9 Gallus gallus bantam, chicken, chickens, Gallus domesticus
## 10 Homo sapiens human
## 11 Macaca mulatta rhesus macaque, rhesus macaques, Rhesus monk…
## 12 Monodelphis domestica gray short-tailed opossum
## 13 Mus musculus house mouse, mouse
## 14 Ornithorhynchus anatinus duck-billed platypus, duckbill platypus, pla…
## 15 Pan troglodytes chimpanzee
## 16 Rattus norvegicus brown rat, Norway rat, rat, rats
## 17 Saccharomyces cerevisiae baker's yeast, brewer's yeast, S. cerevisiae
## 18 Schizosaccharomyces pombe 972h- NA
## 19 Sus scrofa pig, pigs, swine, wild boar
## 20 Xenopus tropicalis tropical clawed frog, western clawed frog
All categories of gene sets:
## gs_cat gs_subcat num_genesets
## 1 C1 299
## 2 C2 CGP 3384
## 3 C2 CP 29
## 4 C2 CP:BIOCARTA 292
## 5 C2 CP:KEGG 186
## 6 C2 CP:PID 196
## 7 C2 CP:REACTOME 1615
## 8 C2 CP:WIKIPATHWAYS 664
## 9 C3 MIR:MIRDB 2377
## 10 C3 MIR:MIR_Legacy 221
## 11 C3 TFT:GTRD 518
## 12 C3 TFT:TFT_Legacy 610
## 13 C4 CGN 427
## 14 C4 CM 431
## 15 C5 GO:BP 7658
## 16 C5 GO:CC 1006
## 17 C5 GO:MF 1738
## 18 C5 HPO 5071
## 19 C6 189
## 20 C7 IMMUNESIGDB 4872
## 21 C7 VAX 347
## 22 C8 700
## 23 H 50
Values in the gs_cat
column and the gs_subcat
column can be used to retrieve gene sets in a specific category. E.g., we want to extract genesets in C2 category and CP:KEGG sub-category:
gene_sets = msigdbr(category = "C2", subcategory = "CP:KEGG")
gene_sets
## # A tibble: 16,283 × 15
## gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
## <chr> <chr> <chr> <chr> <int> <chr>
## 1 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA1 19 ENSG000001650…
## 2 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA10 10349 ENSG000001542…
## 3 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA12 26154 ENSG000001444…
## 4 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA13 154664 ENSG000001798…
## 5 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA2 20 ENSG000001073…
## 6 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA3 21 ENSG000001679…
## 7 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA4 24 ENSG000001986…
## 8 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA5 23461 ENSG000001542…
## 9 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA6 23460 ENSG000001542…
## 10 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA7 10347 ENSG000000646…
## # ℹ 16,273 more rows
## # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
## # human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
## # gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
If you don’t like the tibble
format, you can change it to a data frame.
gene_sets = as.data.frame(gene_sets)
head(gene_sets)
## gs_cat gs_subcat gs_name gene_symbol entrez_gene
## 1 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA1 19
## 2 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA10 10349
## 3 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA12 26154
## 4 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA13 154664
## 5 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA2 20
## 6 C2 CP:KEGG KEGG_ABC_TRANSPORTERS ABCA3 21
## ensembl_gene human_gene_symbol human_entrez_gene human_ensembl_gene gs_id
## 1 ENSG00000165029 ABCA1 19 ENSG00000165029 M11911
## 2 ENSG00000154263 ABCA10 10349 ENSG00000154263 M11911
## 3 ENSG00000144452 ABCA12 26154 ENSG00000144452 M11911
## 4 ENSG00000179869 ABCA13 154664 ENSG00000179869 M11911
## 5 ENSG00000107331 ABCA2 20 ENSG00000107331 M11911
## 6 ENSG00000167972 ABCA3 21 ENSG00000167972 M11911
## gs_pmid gs_geoid gs_exact_source
## 1 hsa02010
## 2 hsa02010
## 3 hsa02010
## 4 hsa02010
## 5 hsa02010
## 6 hsa02010
## gs_url gs_description
## 1 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 2 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 3 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 4 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 5 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
## 6 http://www.genome.jp/kegg/pathway/hsa/hsa02010.html ABC transporters
Danger zone: The
entrez_gene
column is saved in integer. This may cause unseeable problems. Convert it explicitely to character:gene_set$entrez_gene = as.character(gene_set$entrez_gene)
.
Another advantage of misgdbr is that it provides all three major gene ID types. For example, if you need Ensembl gene IDs, you can obtain the gene sets by:
## gs_name ensembl_gene
## 1 KEGG_ABC_TRANSPORTERS ENSG00000165029
## 2 KEGG_ABC_TRANSPORTERS ENSG00000154263
## 3 KEGG_ABC_TRANSPORTERS ENSG00000144452
## 4 KEGG_ABC_TRANSPORTERS ENSG00000179869
## 5 KEGG_ABC_TRANSPORTERS ENSG00000107331
## 6 KEGG_ABC_TRANSPORTERS ENSG00000167972
By default it retrieves data for human. The species
argument can be used for other supported species.
gene_sets = msigdbr(species = "dog", category = "C2", subcategory = "CP:BIOCARTA")
gene_sets = as.data.frame(gene_sets)
head(gene_sets)
## gs_cat gs_subcat gs_name gene_symbol entrez_gene
## 1 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY ATF2 478806
## 2 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY CHUK 477796
## 3 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY IFNG 403801
## 4 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY IKBKB 482839
## 5 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY IL2 403989
## 6 C2 CP:BIOCARTA BIOCARTA_41BB_PATHWAY IL4 403785
## ensembl_gene human_gene_symbol human_entrez_gene human_ensembl_gene
## 1 ENSCAFG00845017708 ATF2 1386 ENSG00000115966
## 2 ENSCAFG00845016257 CHUK 1147 ENSG00000213341
## 3 ENSCAFG00845013680 IFNG 3458 ENSG00000111537
## 4 ENSCAFG00845006197 IKBKB 3551 ENSG00000104365
## 5 ENSCAFG00845020168 IL2 3558 ENSG00000109471
## 6 ENSCAFG00845010991 IL4 3565 ENSG00000113520
## gs_id gs_pmid gs_geoid gs_exact_source
## 1 M2064
## 2 M2064
## 3 M2064
## 4 M2064
## 5 M2064
## 6 M2064
## gs_url
## 1 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 2 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 3 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 4 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 5 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## 6 https://data.broadinstitute.org/gsea-msigdb/msigdb/biocarta/human/h_41BBPathway.gif
## gs_description taxon_id
## 1 The 4-1BB-dependent immune response 9615
## 2 The 4-1BB-dependent immune response 9615
## 3 The 4-1BB-dependent immune response 9615
## 4 The 4-1BB-dependent immune response 9615
## 5 The 4-1BB-dependent immune response 9615
## 6 The 4-1BB-dependent immune response 9615
## ortholog_sources num_ortholog_sources
## 1 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB 5
## 2 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB 5
## 3 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB 5
## 4 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB 5
## 5 Ensembl|HomoloGene|Inparanoid|NCBI 4
## 6 Ensembl|HomoloGene|Inparanoid|NCBI|OrthoDB 5
The table contains mappings between gene sets and all the three gene ID types. Since Symbol-EntreZ-Ensembl mapping is not 1:1, it is possible rows are duplicated if only taking two columns from the complete table:
gene_sets = msigdbr(category = "C5", subcategory = "GO:BP")
tb = gene_sets[, c("gs_name", "gene_symbol")]
dim(tb)
## [1] 721379 2
## [1] 633008 2
So, additionally applying unique()
is always safe.
We also have several functions in GSEAtraining to retrieve gene sets from MSigDB. It directly downloads data from MSigDB web server.
## [1] "6.0" "6.1" "6.2" "7.0" "7.1" "7.2"
## [7] "7.3" "7.4" "7.5.1" "7.5" "2022.1.Hs" "2022.1.Mm"
## [13] "2023.1.Hs" "2023.1.Mm" "2023.2.Hs" "2023.2.Mm"
list_msigdb_collections("2023.2.Hs")
## [1] "c1.all" "c2.all" "c2.cgp"
## [4] "c2.cp.biocarta" "c2.cp.kegg_legacy" "c2.cp.kegg_medicus"
## [7] "c2.cp.pid" "c2.cp.reactome" "c2.cp"
## [10] "c2.cp.wikipathways" "c3.all" "c3.mir.mir_legacy"
## [13] "c3.mir.mirdb" "c3.mir" "c3.tft.gtrd"
## [16] "c3.tft.tft_legacy" "c3.tft" "c4.3ca"
## [19] "c4.all" "c4.cgn" "c4.cm"
## [22] "c5.all" "c5.go.bp" "c5.go.cc"
## [25] "c5.go.mf" "c5.go" "c5.hpo"
## [28] "c6.all" "c7.all" "c7.immunesigdb"
## [31] "c7.vax" "c8.all" "h.all"
lt = get_msigdb(version = "2023.2.Hs", collection = "h.all")
lt[1:2]
## $HALLMARK_ADIPOGENESIS
## [1] "19" "11194" "10449" "33" "34" "35" "47" "50"
## [9] "51" "112" "149685" "9370" "79602" "56894" "9131" "204"
## [17] "217" "226" "284" "51129" "334" "348" "369" "10124"
## [25] "64225" "483" "539" "11176" "593" "23786" "604" "718"
## [33] "847" "284119" "8436" "901" "977" "9936" "948" "1031"
## [41] "400916" "1147" "1149" "134147" "51727" "1306" "1282" "51805"
## [49] "84274" "57017" "1337" "1349" "1351" "1376" "1384" "1431"
## [57] "1537" "1580" "1629" "1652" "1666" "8694" "1717" "51635"
## [65] "25979" "1737" "1738" "4189" "29103" "128338" "1891" "1892"
## [73] "84173" "79071" "5168" "2053" "2101" "23344" "2109" "2167"
## [81] "2184" "8322" "9908" "1647" "2632" "27069" "57678" "137964"
## [89] "2820" "10243" "2878" "2879" "80273" "3033" "26275" "26353"
## [97] "3417" "3419" "3421" "3459" "10989" "3679" "80760" "6453"
## [105] "84522" "3910" "3952" "3977" "3991" "10162" "4023" "4056"
## [113] "8491" "56922" "4191" "4199" "11343" "4259" "84895" "56246"
## [121] "29088" "54996" "23788" "4638" "64859" "4698" "4706" "4713"
## [129] "4722" "28512" "4836" "4958" "5004" "27250" "10400" "5195"
## [137] "5209" "5211" "5236" "23187" "5264" "415116" "123" "5447"
## [145] "5468" "5495" "84919" "10935" "10113" "55037" "5733" "5860"
## [153] "83871" "7905" "92840" "56729" "54884" "8780" "55177" "26994"
## [161] "6239" "10313" "25813" "949" "6342" "6390" "6391" "6573"
## [169] "6510" "6576" "1468" "376497" "8884" "130814" "6623" "6647"
## [177] "10580" "65124" "8404" "58472" "8082" "6776" "2040" "8802"
## [185] "6817" "6888" "10010" "7086" "10140" "7263" "7316" "29979"
## [193] "83549" "7351" "29796" "10975" "7384" "27089" "7423" "7532"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "16" "6059" "10006" "43" "92" "207" "322" "567"
## [9] "586" "8915" "602" "672" "717" "822" "9607" "6356"
## [17] "6357" "6363" "6347" "6367" "6351" "6352" "6354" "894"
## [25] "896" "1230" "729230" "1234" "912" "914" "919" "940"
## [33] "915" "916" "917" "920" "958" "959" "961" "924"
## [41] "972" "973" "941" "942" "925" "926" "10225" "1029"
## [49] "5199" "56253" "1435" "1445" "1520" "10563" "4283" "2833"
## [57] "1615" "8560" "8444" "1956" "8661" "8664" "8669" "8672"
## [65] "1984" "1991" "2000" "2069" "2113" "2147" "2149" "355"
## [73] "356" "2213" "2268" "2316" "2533" "2589" "2634" "2650"
## [81] "11146" "8477" "3001" "3002" "3059" "9734" "3091" "3105"
## [89] "3108" "3109" "3111" "3112" "3117" "3122" "3133" "3135"
## [97] "3383" "23308" "3455" "3458" "3459" "3460" "10261" "3551"
## [105] "3586" "3589" "3592" "3593" "3594" "3596" "3600" "3603"
## [113] "3606" "8807" "3553" "3558" "9466" "3559" "3560" "3561"
## [121] "3565" "3566" "3569" "3574" "3578" "3624" "3625" "3662"
## [129] "3665" "3394" "3683" "3689" "3702" "3717" "3824" "3848"
## [137] "3932" "3937" "3976" "4050" "4065" "9450" "4067" "6885"
## [145] "11184" "4153" "4318" "11222" "4528" "4689" "4690" "9437"
## [153] "114548" "4830" "4843" "4869" "5196" "5551" "5579" "5582"
## [161] "5699" "5777" "5788" "5917" "8767" "6170" "6123" "6133"
## [169] "6223" "6189" "6203" "27240" "8651" "9655" "6688" "5552"
## [177] "7903" "23166" "6772" "6775" "6890" "6891" "6892" "7040"
## [185] "7042" "7070" "7076" "7096" "7097" "7098" "10333" "7124"
## [193] "7163" "7186" "50852" "7321" "7334" "7453" "7454" "7535"
The gene set resource on https://maayanlab.cloud/Enrichr/#libraries is very useful. You may want to use it some say in the future. Take one gene set collection (e.g. the COVID-19 related gene sets), download the corresponding gene set file (in .gmt
format) and try to read into R as a list or a two-column data frame.
download.file("https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=COVID-19_Related_Gene_Sets", destfile = "covid-19.gmt")
ln = readLines("covid-19.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
gs[1:2]
## $`COVID19-E protein host PPI from Krogan`
## [1] "BRD4" "BRD2" "SLC44A2" "ZC3H18" "AP3B1" "CWC27"
##
## $`COVID19-M protein host PPI from Krogan`
## [1] "AAR2" "AASS" "SLC30A7" "SLC30A9" "INTS4" "SAAL1"
## [7] "ANO6" "ATP1B1" "YIF1A" "REEP6" "GGCX" "REEP5"
## [13] "COQ8B" "TARS2" "FAM8A1" "ATP6V1A" "RTN4" "TUBGCP2"
## [19] "TUBGCP3" "AKAP8L" "FASTKD5" "ETFA" "BZW2" "PSMD8"
## [25] "ACADM" "PITRM1" "STOM" "PMPCB" "PMPCA" "SLC25A21"
Convert gs
to a data frame
## geneset
## COVID19-E protein host PPI from Krogan1 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan2 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan3 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan4 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan5 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan6 COVID19-E protein host PPI from Krogan
## gene
## COVID19-E protein host PPI from Krogan1 BRD4
## COVID19-E protein host PPI from Krogan2 BRD2
## COVID19-E protein host PPI from Krogan3 SLC44A2
## COVID19-E protein host PPI from Krogan4 ZC3H18
## COVID19-E protein host PPI from Krogan5 AP3B1
## COVID19-E protein host PPI from Krogan6 CWC27
Or use the list_to_dataframe()
function in GSEAtraining:
df = list_to_dataframe(gs)
head(df)
## gene_set gene
## 1 COVID19-E protein host PPI from Krogan BRD4
## 2 COVID19-E protein host PPI from Krogan BRD2
## 3 COVID19-E protein host PPI from Krogan SLC44A2
## 4 COVID19-E protein host PPI from Krogan ZC3H18
## 5 COVID19-E protein host PPI from Krogan AP3B1
## 6 COVID19-E protein host PPI from Krogan CWC27