Topic 1-03: MSigDB gene sets
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic1_03_msigdb.Rmd
topic1_03_msigdb.RmdThe .gmt data format
MSigDB defines a simple .gmt format for storing gene
sets. It has the following format:
gene_set_1 gene_set_description gene1 gene2 gene3
gene_set_2 gene_set_description gene4 gene5
...
The .gmt format is used more and more for sharing gene
set data, e.g. https://maayanlab.cloud/Enrichr/#libraries.
Let’s try to read a .gmt file. We use the
.gmt file for the Hallmark gene set.
download.file("https://data.broadinstitute.org/gsea-msigdb/msigdb/release/2024.1.Hs/h.all.v2024.1.Hs.symbols.gmt",
destfile = "h.all.v2024.1.Hs.symbols.gmt")Since it is not a table, we have to read the file by lines.
ln = readLines("h.all.v2024.1.Hs.symbols.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
gs_names = sapply(ln, function(x) x[1])
names(gs) = gs_names
gs[1:2]## $HALLMARK_ADIPOGENESIS
## [1] "ABCA1" "ABCB8" "ACAA2" "ACADL" "ACADM" "ACADS"
## [7] "ACLY" "ACO2" "ACOX1" "ADCY6" "ADIG" "ADIPOQ"
## [13] "ADIPOR2" "AGPAT3" "AIFM1" "AK2" "ALDH2" "ALDOA"
## [19] "ANGPT1" "ANGPTL4" "APLP2" "APOE" "ARAF" "ARL4A"
## [25] "ATL2" "ATP1B3" "ATP5PO" "BAZ2A" "BCKDHA" "BCL2L13"
## [31] "BCL6" "C3" "CAT" "CAVIN1" "CAVIN2" "CCNG2"
## [37] "CD151" "CD302" "CD36" "CDKN2C" "CHCHD10" "CHUK"
## [43] "CIDEA" "CMBL" "CMPK1" "COL15A1" "COL4A1" "COQ3"
## [49] "COQ5" "COQ9" "COX6A1" "COX7B" "COX8A" "CPT2"
## [55] "CRAT" "CS" "CYC1" "CYP4B1" "DBT" "DDT"
## [61] "DECR1" "DGAT1" "DHCR7" "DHRS7" "DHRS7B" "DLAT"
## [67] "DLD" "DNAJB9" "DNAJC15" "DRAM2" "ECH1" "ECHS1"
## [73] "ELMOD3" "ELOVL6" "ENPP2" "EPHX2" "ESRRA" "ESYT1"
## [79] "ETFB" "FABP4" "FAH" "FZD4" "G3BP2" "GADD45A"
## [85] "GBE1" "GHITM" "GPAM" "GPAT4" "GPD2" "GPHN"
## [91] "GPX3" "GPX4" "GRPEL1" "HADH" "HIBCH" "HSPB8"
## [97] "IDH1" "IDH3A" "IDH3G" "IFNGR1" "IMMT" "ITGA7"
## [103] "ITIH5" "ITSN1" "JAGN1" "LAMA4" "LEP" "LIFR"
## [109] "LIPE" "LPCAT3" "LPL" "LTC4S" "MAP4K3" "MCCC1"
## [115] "MDH2" "ME1" "MGLL" "MGST3" "MIGA2" "MRAP"
## [121] "MRPL15" "MTARC2" "MTCH2" "MYLK" "NABP1" "NDUFA5"
## [127] "NDUFAB1" "NDUFB7" "NDUFS3" "NKIRAS1" "NMT1" "OMD"
## [133] "ORM1" "PDCD4" "PEMT" "PEX14" "PFKFB3" "PFKL"
## [139] "PGM1" "PHLDB1" "PHYH" "PIM3" "PLIN2" "POR"
## [145] "PPARG" "PPM1B" "PPP1R15B" "PRDX3" "PREB" "PTCD3"
## [151] "PTGER3" "QDPR" "RAB34" "REEP5" "REEP6" "RETN"
## [157] "RETSAT" "RIOK3" "RMDN3" "RNF11" "RREB1" "RTN3"
## [163] "SAMM50" "SCARB1" "SCP2" "SDHB" "SDHC" "SLC19A1"
## [169] "SLC1A5" "SLC25A1" "SLC25A10" "SLC27A1" "SLC5A6" "SLC66A3"
## [175] "SNCG" "SOD1" "SORBS1" "SOWAHC" "SPARCL1" "SQOR"
## [181] "SSPN" "STAT5A" "STOM" "SUCLG1" "SULT1A1" "TALDO1"
## [187] "TANK" "TKT" "TOB1" "TST" "UBC" "UBQLN1"
## [193] "UCK1" "UCP2" "UQCR10" "UQCR11" "UQCRC1" "UQCRQ"
## [199] "VEGFB" "YWHAG"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "AARS1" "ABCE1" "ABI1" "ACHE" "ACVR2A" "AKT1"
## [7] "APBB1" "B2M" "BCAT1" "BCL10" "BCL3" "BRCA1"
## [13] "C2" "CAPG" "CARTPT" "CCL11" "CCL13" "CCL19"
## [19] "CCL2" "CCL22" "CCL4" "CCL5" "CCL7" "CCND2"
## [25] "CCND3" "CCR1" "CCR2" "CCR5" "CD1D" "CD2"
## [31] "CD247" "CD28" "CD3D" "CD3E" "CD3G" "CD4"
## [37] "CD40" "CD40LG" "CD47" "CD7" "CD74" "CD79A"
## [43] "CD80" "CD86" "CD8A" "CD8B" "CD96" "CDKN2A"
## [49] "CFP" "CRTAM" "CSF1" "CSK" "CTSS" "CXCL13"
## [55] "CXCL9" "CXCR3" "DARS1" "DEGS1" "DYRK3" "EGFR"
## [61] "EIF3A" "EIF3D" "EIF3J" "EIF4G3" "EIF5A" "ELANE"
## [67] "ELF4" "EREG" "ETS1" "F2" "F2R" "FAS"
## [73] "FASLG" "FCGR2B" "FGR" "FLNA" "FYB1" "GALNT1"
## [79] "GBP2" "GCNT1" "GLMN" "GPR65" "GZMA" "GZMB"
## [85] "HCLS1" "HDAC9" "HIF1A" "HLA-A" "HLA-DMA" "HLA-DMB"
## [91] "HLA-DOA" "HLA-DOB" "HLA-DQA1" "HLA-DRA" "HLA-E" "HLA-G"
## [97] "ICAM1" "ICOSLG" "IFNAR2" "IFNG" "IFNGR1" "IFNGR2"
## [103] "IGSF6" "IKBKB" "IL10" "IL11" "IL12A" "IL12B"
## [109] "IL12RB1" "IL13" "IL15" "IL16" "IL18" "IL18RAP"
## [115] "IL1B" "IL2" "IL27RA" "IL2RA" "IL2RB" "IL2RG"
## [121] "IL4" "IL4R" "IL6" "IL7" "IL9" "INHBA"
## [127] "INHBB" "IRF4" "IRF7" "IRF8" "ITGAL" "ITGB2"
## [133] "ITK" "JAK2" "KLRD1" "KRT1" "LCK" "LCP2"
## [139] "LIF" "LTB" "LY75" "LY86" "LYN" "MAP3K7"
## [145] "MAP4K1" "MBL2" "MMP9" "MRPL3" "MTIF2" "NCF4"
## [151] "NCK1" "NCR1" "NLRP3" "NME1" "NOS2" "NPM1"
## [157] "PF4" "PRF1" "PRKCB" "PRKCG" "PSMB10" "PTPN6"
## [163] "PTPRC" "RARS1" "RIPK2" "RPL39" "RPL3L" "RPL9"
## [169] "RPS19" "RPS3A" "RPS9" "SIT1" "SOCS1" "SOCS5"
## [175] "SPI1" "SRGN" "ST8SIA4" "STAB1" "STAT1" "STAT4"
## [181] "TAP1" "TAP2" "TAPBP" "TGFB1" "TGFB2" "THY1"
## [187] "TIMP1" "TLR1" "TLR2" "TLR3" "TLR6" "TNF"
## [193] "TPD52" "TRAF2" "TRAT1" "UBE2D1" "UBE2N" "WARS1"
## [199] "WAS" "ZAP70"
## HALLMARK_ADIPOGENESIS
## "https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_ADIPOGENESIS"
## HALLMARK_ALLOGRAFT_REJECTION
## "https://www.gsea-msigdb.org/gsea/msigdb/human/geneset/HALLMARK_ALLOGRAFT_REJECTION"
Read gene sets from MSigDB
All the MSigDB gene sets are hosted on https://data.broadinstitute.org/gsea-msigdb/msigdb/release/. In the GSEAtopics packages, there are several helper functions to get gene sets from a specific collection.
## [1] "6.0" "6.1" "6.2" "7.0" "7.1" "7.2"
## [7] "7.3" "7.4" "7.5.1" "7.5" "2022.1.Hs" "2022.1.Mm"
## [13] "2023.1.Hs" "2023.1.Mm" "2023.2.Hs" "2023.2.Mm" "2024.1.Hs" "2024.1.Mm"
list_msigdb_collections("2024.1.Hs")## [1] "c1.all" "c2.all" "c2.cgp"
## [4] "c2.cp.biocarta" "c2.cp.kegg_legacy" "c2.cp.kegg_medicus"
## [7] "c2.cp.pid" "c2.cp.reactome" "c2.cp"
## [10] "c2.cp.wikipathways" "c3.all" "c3.mir.mir_legacy"
## [13] "c3.mir.mirdb" "c3.mir" "c3.tft.gtrd"
## [16] "c3.tft.tft_legacy" "c3.tft" "c4.3ca"
## [19] "c4.all" "c4.cgn" "c4.cm"
## [22] "c5.all" "c5.go.bp" "c5.go.cc"
## [25] "c5.go.mf" "c5.go" "c5.hpo"
## [28] "c6.all" "c7.all" "c7.immunesigdb"
## [31] "c7.vax" "c8.all" "h.all"
lt = get_msigdb(version = "2024.1.Hs", collection = "h.all")
lt[1:2]## $HALLMARK_ADIPOGENESIS
## [1] "19" "11194" "10449" "33" "34" "35" "47" "50"
## [9] "51" "112" "149685" "9370" "79602" "56894" "9131" "204"
## [17] "217" "226" "284" "51129" "334" "348" "369" "10124"
## [25] "64225" "483" "539" "11176" "593" "23786" "604" "718"
## [33] "847" "284119" "8436" "901" "977" "9936" "948" "1031"
## [41] "400916" "1147" "1149" "134147" "51727" "1306" "1282" "51805"
## [49] "84274" "57017" "1337" "1349" "1351" "1376" "1384" "1431"
## [57] "1537" "1580" "1629" "1652" "1666" "8694" "1717" "51635"
## [65] "25979" "1737" "1738" "4189" "29103" "128338" "1891" "1892"
## [73] "84173" "79071" "5168" "2053" "2101" "23344" "2109" "2167"
## [81] "2184" "8322" "9908" "1647" "2632" "27069" "57678" "137964"
## [89] "2820" "10243" "2878" "2879" "80273" "3033" "26275" "26353"
## [97] "3417" "3419" "3421" "3459" "10989" "3679" "80760" "6453"
## [105] "84522" "3910" "3952" "3977" "3991" "10162" "4023" "4056"
## [113] "8491" "56922" "4191" "4199" "11343" "4259" "84895" "56246"
## [121] "29088" "54996" "23788" "4638" "64859" "4698" "4706" "4713"
## [129] "4722" "28512" "4836" "4958" "5004" "27250" "10400" "5195"
## [137] "5209" "5211" "5236" "23187" "5264" "415116" "123" "5447"
## [145] "5468" "5495" "84919" "10935" "10113" "55037" "5733" "5860"
## [153] "83871" "7905" "92840" "56729" "54884" "8780" "55177" "26994"
## [161] "6239" "10313" "25813" "949" "6342" "6390" "6391" "6573"
## [169] "6510" "6576" "1468" "376497" "8884" "130814" "6623" "6647"
## [177] "10580" "65124" "8404" "58472" "8082" "6776" "2040" "8802"
## [185] "6817" "6888" "10010" "7086" "10140" "7263" "7316" "29979"
## [193] "83549" "7351" "29796" "10975" "7384" "27089" "7423" "7532"
##
## $HALLMARK_ALLOGRAFT_REJECTION
## [1] "16" "6059" "10006" "43" "92" "207" "322" "567"
## [9] "586" "8915" "602" "672" "717" "822" "9607" "6356"
## [17] "6357" "6363" "6347" "6367" "6351" "6352" "6354" "894"
## [25] "896" "1230" "729230" "1234" "912" "914" "919" "940"
## [33] "915" "916" "917" "920" "958" "959" "961" "924"
## [41] "972" "973" "941" "942" "925" "926" "10225" "1029"
## [49] "5199" "56253" "1435" "1445" "1520" "10563" "4283" "2833"
## [57] "1615" "8560" "8444" "1956" "8661" "8664" "8669" "8672"
## [65] "1984" "1991" "2000" "2069" "2113" "2147" "2149" "355"
## [73] "356" "2213" "2268" "2316" "2533" "2589" "2634" "2650"
## [81] "11146" "8477" "3001" "3002" "3059" "9734" "3091" "3105"
## [89] "3108" "3109" "3111" "3112" "3117" "3122" "3133" "3135"
## [97] "3383" "23308" "3455" "3458" "3459" "3460" "10261" "3551"
## [105] "3586" "3589" "3592" "3593" "3594" "3596" "3600" "3603"
## [113] "3606" "8807" "3553" "3558" "9466" "3559" "3560" "3561"
## [121] "3565" "3566" "3569" "3574" "3578" "3624" "3625" "3662"
## [129] "3665" "3394" "3683" "3689" "3702" "3717" "3824" "3848"
## [137] "3932" "3937" "3976" "4050" "4065" "9450" "4067" "6885"
## [145] "11184" "4153" "4318" "11222" "4528" "4689" "4690" "9437"
## [153] "114548" "4830" "4843" "4869" "5196" "5551" "5579" "5582"
## [161] "5699" "5777" "5788" "5917" "8767" "6170" "6123" "6133"
## [169] "6223" "6189" "6203" "27240" "8651" "9655" "6688" "5552"
## [177] "7903" "23166" "6772" "6775" "6890" "6891" "6892" "7040"
## [185] "7042" "7070" "7076" "7096" "7097" "7098" "10333" "7124"
## [193] "7163" "7186" "50852" "7321" "7334" "7453" "7454" "7535"
Other package
There is a package msigdbr on https://cran.r-project.org/web/packages/msigdbr/index.html.
In the rGREAT package, there is al
read_gmt() function which reads from a .gmt
file and supports gene ID conversion.
Practice
Practice 1
The gene set resource on https://maayanlab.cloud/Enrichr/#libraries is very
useful. You may want to use it some day in the future. Take one gene set
collection (e.g. the COVID-19 related gene sets), download the
corresponding gene set file (in .gmt format) and try to
read into R as a list or a two-column data frame.
download.file("https://maayanlab.cloud/Enrichr/geneSetLibrary?mode=text&libraryName=COVID-19_Related_Gene_Sets", destfile = "covid-19.gmt")
ln = readLines("covid-19.gmt")
ln = strsplit(ln, "\t")
gs = lapply(ln, function(x) x[-(1:2)])
names(gs) = sapply(ln, function(x) x[1])
gs[1:2]## $`COVID19-E protein host PPI from Krogan`
## [1] "BRD4" "BRD2" "SLC44A2" "ZC3H18" "AP3B1" "CWC27"
##
## $`COVID19-M protein host PPI from Krogan`
## [1] "AAR2" "AASS" "SLC30A7" "SLC30A9" "INTS4" "SAAL1"
## [7] "ANO6" "ATP1B1" "YIF1A" "REEP6" "GGCX" "REEP5"
## [13] "COQ8B" "TARS2" "FAM8A1" "ATP6V1A" "RTN4" "TUBGCP2"
## [19] "TUBGCP3" "AKAP8L" "FASTKD5" "ETFA" "BZW2" "PSMD8"
## [25] "ACADM" "PITRM1" "STOM" "PMPCB" "PMPCA" "SLC25A21"
Convert gs to a data frame
## geneset
## COVID19-E protein host PPI from Krogan1 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan2 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan3 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan4 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan5 COVID19-E protein host PPI from Krogan
## COVID19-E protein host PPI from Krogan6 COVID19-E protein host PPI from Krogan
## gene
## COVID19-E protein host PPI from Krogan1 BRD4
## COVID19-E protein host PPI from Krogan2 BRD2
## COVID19-E protein host PPI from Krogan3 SLC44A2
## COVID19-E protein host PPI from Krogan4 ZC3H18
## COVID19-E protein host PPI from Krogan5 AP3B1
## COVID19-E protein host PPI from Krogan6 CWC27
Or use the list_to_data_frame() function in
GSAEtopics:
df = list_to_data_frame(gs)
head(df)## gene_set gene
## 1 COVID19-E protein host PPI from Krogan BRD4
## 2 COVID19-E protein host PPI from Krogan BRD2
## 3 COVID19-E protein host PPI from Krogan SLC44A2
## 4 COVID19-E protein host PPI from Krogan ZC3H18
## 5 COVID19-E protein host PPI from Krogan AP3B1
## 6 COVID19-E protein host PPI from Krogan CWC27