I recently came across a paper entitled “The proteogenomic subtypes of acute myeloid leukemia” published on Cancer Cell, which performed subtype classifications on acute myeloid leukemia (AML) with the proteomics data. In the paper, the classification was performed by hierarchical clustering on the distance matrix of the proteomics dataset.
The motivations for me to reanalyze this dataset are the following two:
- I have never analyzed proteomics data, thus I want to see how the data looks like;
- I have never applied cola package on proteomics datasets. I want to see its performance and whether it can generate biological meaningful results.
First the data (the Excel sheet, Data S1 Discovery Cohort.xlsx
) is from the Supplementary Data S1 of the paper.
I first read the clinical annotation data. There are quite a lot of columns and I don’t really need all of them.
library(readxl)
anno = read_xlsx("~/Downloads/Data S1 Discovery Cohort.xlsx", sheet = 2)
anno = as.data.frame(anno)
rownames(anno) = anno[, 1] # the first column is sample id
anno = anno[, -1]
head(anno)
## Sex Age at Diagnosis [Years] Peripheral Blasts [%] Bone marrow Blasts [%]
## F1 f 61.01096 1 40
## F2 m 68.27671 56 54
## F4 f 59.23836 3 50
## F5 m 65.95890 0 30
## F6 f 41.03014 37 51
## F7 m 63.93151 25 27
## Precursor Karyotype NPM1 FLT3 FLT3ITD VAF
## F1 sAML 46 XX del (11)(q14q24)[15], 46,XX [5] WT WT NA
## F2 de novo 47-48,XY,+4,+8,+13 cp WT WT NA
## F4 sAML 46, XX WT ITD 4.34
## F5 de novo 46,XY WT WT NA
## F6 de novo 46,XX,-7,+8, +11 WT WT NA
## F7 de novo 46,XY WT WT NA
## ELN2017 FAB Induction cycles [n] Day 15 bone marrow blasts CR1
## F1 Intermediate M2 2 No blast clearance no CR1
## F2 Adverse M5 2 No blast clearance CR1
## F4 Adverse M4 2 Blast clearance CR1
## F5 Adverse M5b 1 Blast clearance CR1
## F6 Adverse M4 2 No blast clearance no CR1
## F7 Intermediate M1 2 No blast clearance CR1
## Allogeneic HSCT Death Event OS [months] Relapse Event Relapse Date
## F1 non Transplant 1 10.316222 0 <NA>
## F2 non Transplant 1 6.176591 1 2013-09-26
## F4 Transplant 0 41.002053 0 <NA>
## F5 non Transplant 1 7.227926 1 2013-08-05
## F6 Transplant 1 55.195072 1 2016-07-20
## F7 non Transplant 0 25.954825 1 2020-07-13
## RFS [months]
## F1 NA
## F2 2.529774
## F4 41.002053
## F5 5.158111
## F6 NA
## F7 24.574949
Next I read the mutation data and covert it into a matrix where rows are genes and columns are samples. I also format the gene symbols that if there are mutiple genes annotated for a single mutation, only the first gene symbol is used.
mut = read_xlsx("~/Downloads/Data S1 Discovery Cohort.xlsx", sheet = 3)
mut = as.data.frame(mut)
mut[, "Gene"] = gsub(",.*$", "", mut[, "Gene"])
library(reshape2)
f = function(x) paste(unique(x), collapse = ";")
mut = acast(mut, Gene ~ Pat_ID, f, value.var = "Type")
mut[1:5, 1:5]
## F1 F10 F101 F102 F103
## ANKRD26 "" "" "" "" ""
## ASXL1 "" "" "" "" ""
## ASXL2 "" "" "" "" ""
## ATM "" "" "SNV" "" ""
## ATRX "SNV" "" "" "" "SNV"
Next I read the proteomics data matrix. Note here I use the matrix where missing values are all imputed. In the proteomics data matrix, rows actually correspond to proteins. For the ease of discussion, I always write “genes” to describe rows of the proteomics data matrix.
mat = read_xlsx("~/Downloads/Data S1 Discovery Cohort.xlsx", sheet = 6)
mat = mat[-1, ]
gene = unname(unlist(mat[, "PG.Genes"]))
mat = mat[, 1:177]
cn = colnames(mat)
mat = as.numeric(as.matrix(mat))
dim(mat) = c(length(gene), 177)
colnames(mat) = cn
l = is.na(gene)
gene = gene[!l]
mat = mat[!l, ]
Similar as the mutation matrix, I also performed cleaning on gene symbols.
gene = gsub("^;", "", gene)
gene = gsub(";.*$", "", gene)
There are some genes wrongly formatted to dates in Excel. Here I convert them back;
grep("^\\d+-", gene, value = TRUE)
## [1] "5-Mar" "15-Sep" "6-Sep" "2-Sep" "7-Sep" "1-Sep" "8-Sep" "5-Sep"
## [9] "11-Sep" "9-Sep"
gene = gsub("^(\\d+)-Mar$", "MAR\\1", gene)
gene = gsub("^(\\d+)-Sep$", "SEP\\1", gene)
Now I can check whether all genes are unique:
any(duplicated(gene))
## [1] FALSE
Since now all genes are unique, I can assign them as row names of the matrix.
rownames(mat) = gene
Next I adjust the order of the annotation table and mutation matrix to make them consistent to the proteomics data matrix.
anno = anno[colnames(mat), ]
mut2 = matrix("", nrow = nrow(mut), ncol = ncol(mat))
rownames(mut2) = rownames(mut)
colnames(mut2) = colnames(mat)
mut2[rownames(mut), colnames(mut)] = mut
dim(anno)
## [1] 177 20
dim(mut2)
## [1] 79 177
dim(mat)
## [1] 5193 177
identical(rownames(anno), colnames(mat))
## [1] TRUE
identical(colnames(mut2), colnames(mat))
## [1] TRUE
In the paper, the subtype classification was based on hierarchical clustering on the distance matrix of samples. Here I perform the subtype classification by consensus partitioning, with the cola package.
In the following code, ATC is used to extract top features and skmeans is used for partitioning.
library(cola)
set.seed(12345)
mat = adjust_matrix(mat) # simply to remove outliers if there is any
res = consensus_partition(mat, top_value_method = "ATC", partition_method = "skmeans", verbose = FALSE)
collect_plots()
generates a lot of plots that help to decide which number of groups gives “the best partition”.
collect_plots(res, verbose = FALSE)
collect_stats()
applies several quantitative measures for deciding the best k.
collect_stats(res)
All the plots show 4 is the optimized number of subgroups.
best_k = 4
The subtype labels are saved in the variable cl
. And the size of each subgroup are:
cl = get_classes(res, k = best_k)[, 1]
table(cl)
## cl
## 1 2 3 4
## 68 49 34 26
Next I make the signature heatmap which contains genes showing significantly difference between subgroups.
The object tb
contains the results of the signature analysis.
tb = get_signatures(res, k = best_k, verbose = FALSE)
head(tb)
## which_row fdr p_value mean_1 mean_2 mean_3 mean_4
## 1 1 1.314847e-13 1.245953e-14 15.74037 16.34356 16.05322 16.32805
## 2 2 1.957486e-04 1.140445e-04 13.48803 13.87960 13.38700 14.05535
## 3 3 9.774668e-13 1.113595e-13 15.42199 16.03708 15.30332 16.49124
## 4 4 8.892207e-08 2.646211e-08 14.33122 15.03357 15.10674 14.29921
## 5 5 1.412425e-11 2.061517e-12 15.60644 16.27084 15.95040 16.32491
## 6 6 3.339018e-02 2.821209e-02 18.36344 19.01163 18.20078 18.71306
## group_diff scaled_mean_1 scaled_mean_2 scaled_mean_3 scaled_mean_4
## 1 0.6031946 -0.6620782 0.6121620 -0.001177959 0.5793944
## 2 0.6683547 -0.2350354 0.2942459 -0.371595263 0.5318077
## 3 1.1879209 -0.3971493 0.3986856 -0.550696614 0.9862904
## 4 0.8075347 -0.4030978 0.4389542 0.526675741 -0.4414770
## 5 0.7184693 -0.6126957 0.5320413 -0.020055290 0.6251964
## 6 0.8108483 -0.1444529 0.3199708 -0.261002551 0.1060503
## group_diff_scaled km
## 1 1.2742402 4
## 2 0.9034030 5
## 3 1.5369870 3
## 4 0.9681528 4
## 5 1.2378921 4
## 6 0.5809734 3
Next I make an integrative visualization that contains subgroup classification, annotations, signatures and mutations. Note in the following plot, columns always correspond to the same samples in the heatmap and in the oncoprint.
m_sig = mat[tb$which_row, ]
m_sig = t(scale(t(m_sig)))
library(ComplexHeatmap)
library(circlize)
Heatmap(m_sig, col = colorRamp2(c(-2, 0, 2), c("green", "white", "red")),
top_annotation = HeatmapAnnotation(df = anno[, c("NPM1", "FLT3", "ELN2017", "FAB")],
col = list(NPM1 = c("WT" = "#EEEEEE", "Mut" = "pink"),
FLT3 = c("WT" = "#EEEEEE", "ITD" = "brown", "TKD" = "green"),
ELN2017 = c("Intermediate" = "yellow", "Favorable" = "blue", "Adverse" = "red"))
),
row_split = tb$km, column_split = cl, show_row_names = FALSE, show_column_names = FALSE,
cluster_row_slices = FALSE, cluster_column_slices = FALSE,
show_row_dend = FALSE, show_column_dend = FALSE
) %v%
oncoPrint(mut2, alter_fun = list(
"background" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h - unit(2, "pt"), gp = gpar(fill = "#CCCCCC", col = NA)),
"Deletion" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h - unit(2, "pt"), gp = gpar(fill = "orange", col = NA)),
"Insertion" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h - unit(2, "pt"), gp = gpar(fill = "red", col = NA)),
"MNV" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h - unit(2, "pt"), gp = gpar(fill = "darkgreen", col = NA)),
"Replacement" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h - unit(2, "pt"), gp = gpar(fill = "pink", col = NA)),
"SNV" = function(x, y, w, h)
grid.rect(x, y, w - unit(2, "pt"), h*0.33, gp = gpar(fill = "blue", col = NA))
), col = c("Deletion" = "orange", "Insertion" = "red", "MNV" = "darkgreen", "Replacement" = "pink", "SNV" = "blue"),
height = unit(24, "cm")
)
I use test_between_factors()
to check the correlation between cola subgroups
and groups in the annotation table anno
. If the annotation values are numeric,
one-way ANOVA is used and if the annotation values are categorical, chi-square test
is used. In the following result, values are p-values from the corresponding tests.
anno2 = anno[, c("Sex",
"Age at Diagnosis [Years]",
"Peripheral Blasts [%]",
"Bone marrow Blasts [%]",
"Precursor",
"NPM1",
"FLT3",
"ELN2017",
"Day 15 bone marrow blasts",
"CR1",
"Allogeneic HSCT")]
test_between_factors(anno2, as.character(cl))
## as.character(cl)
## Sex 0.6658062138
## Age at Diagnosis [Years] 0.8906220951
## Peripheral Blasts [%] 0.0001369821
## Bone marrow Blasts [%] 0.0005006455
## Precursor 0.0454664541
## NPM1 0.5865044172
## FLT3 0.3922287488
## ELN2017 0.1295126637
## Day 15 bone marrow blasts 0.4613286219
## CR1 0.0829606033
## Allogeneic HSCT 0.0014161595
The tests show annotations of “Peripheral Blasts [%]”, “Bone marrow Blasts [%]” and “Allogeneic HSCT” have significant relations to the cola subgroup classification.
Next chunk of code visualizes how the different annotations distribute in the cola subgroups. Boxplots are used when the annotation values are numeric and barplots are used when the annotation values are categorical.
par(mfrow = c(4, 3))
for(i in seq_len(ncol(anno2))) {
if(is.numeric(anno2[[i]])) {
boxplot(anno2[[i]] ~ cl, xlab = "cola group", ylab = colnames(anno2)[i], main = colnames(anno2)[i])
} else {
tbv = table(anno2[[i]], cl)
level = rownames(tbv)
barplot(table(anno2[[i]], cl), xlab = "cola group", ylab = colnames(anno2)[i],
col = 1+ seq_along(level), main = colnames(anno2)[i])
legend("topright", pch = 15, legend = level, col = 1+ seq_along(level))
}
}
In the annotation table, there is also the survival data, thus, I can check the survival curves in the four cola subgroups.
anno$cola = cl # append `cl` to the annotation table
library("survival")
library("survminer")
fit = survfit(Surv(`OS [months]`, `Death Event`) ~ cola, data = anno)
ggsurvplot(fit)
Next I check the pairwise survival difference. Values in the results are p-values. We can see cola group 1 and group 4 show the most significant survival difference.
pairwise_survdiff(Surv(`OS [months]`, `Death Event`) ~ cola, data = anno, p.adjust.method = "none")
##
## Pairwise comparisons using Log-Rank test
##
## data: anno and cola
##
## 1 2 3
## 2 0.087 - -
## 3 0.766 0.283 -
## 4 0.031 0.565 0.122
##
## P value adjustment method: none
Recall the following signature heatmap which I showed before:
get_signatures(res, k = best_k, verbose = FALSE)
In rows, there are five row groups. I apply Gene Ontology enrichment on the genes
in the five row groups separately, with the function functional_enrichment()
.
The enrichment results are visualized by the package simplifyEnrichment.
In the following heatmap, the left heatmap (green-red) illustrates the adjusted p-values (FDRs) of GO terms enriched in the corresponding group of genes. The right heatmap shows the similarity of GO terms. The word clouds illustrates the overrepresented keywords from the GO terms in each GO cluster.
lt = functional_enrichment(res, k = best_k, ont = "BP", verbose = FALSE)
library(simplifyEnrichment)
simplifyGOFromMultipleLists(lt, verbose = FALSE)
## Perform keywords enrichment for 6 GO lists...
I also apply Gene Ontology enrichment with the Cellular Component (CC) terms. Similar code as below:
lt = functional_enrichment(res, k = best_k, ont = "CC", verbose = FALSE)
library(simplifyEnrichment)
simplifyGOFromMultipleLists(lt, verbose = FALSE)
## Perform keywords enrichment for 6 GO lists...
In the original paper, all 177 samples were classified into six subgroups. However, I cannot find the subgroup classification results in the paper or in the supplementary files, thus I cannot compare cola classification to the classification performed in the paper.
Session info
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] simplifyEnrichment_1.5.2 org.Hs.eg.db_3.14.0 AnnotationDbi_1.56.2
## [4] IRanges_2.28.0 S4Vectors_0.32.3 Biobase_2.54.0
## [7] BiocGenerics_0.40.0 survminer_0.4.9 ggpubr_0.4.0
## [10] ggplot2_3.3.5 survival_3.2-13 circlize_0.4.14
## [13] ComplexHeatmap_2.11.1 cola_2.0.1 reshape2_1.4.4
## [16] readxl_1.3.1 knitr_1.37
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 tidyselect_1.1.1 RSQLite_2.2.9
## [4] BiocParallel_1.28.3 scatterpie_0.1.7 munsell_0.5.0
## [7] codetools_0.2-18 withr_2.4.3 colorspace_2.0-2
## [10] GOSemSim_2.20.0 NLP_0.2-1 highr_0.9
## [13] ggsignif_0.6.3 DOSE_3.20.1 labeling_0.4.2
## [16] slam_0.1-50 GenomeInfoDbData_1.2.7 KMsurv_0.1-5
## [19] polyclip_1.10-0 bit64_4.0.5 farver_2.1.0
## [22] downloader_0.4 vctrs_0.3.8 treeio_1.18.1
## [25] generics_0.1.1 xfun_0.29 R6_2.5.1
## [28] markdown_1.1 doParallel_1.0.16 GenomeInfoDb_1.30.0
## [31] clue_0.3-60 graphlayouts_0.8.0 bitops_1.0-7
## [34] cachem_1.0.6 fgsea_1.20.0 gridGraphics_0.5-1
## [37] assertthat_0.2.1 scales_1.1.1 ggraph_2.0.5
## [40] enrichplot_1.14.1 gtable_0.3.0 Cairo_1.5-14
## [43] tidygraph_1.2.0 rlang_0.4.12 genefilter_1.76.0
## [46] eulerr_6.1.1 GlobalOptions_0.1.2 splines_4.1.2
## [49] rstatix_0.7.0 lazyeval_0.2.2 impute_1.68.0
## [52] broom_0.7.11 brew_1.0-6 yaml_2.2.1
## [55] abind_1.4-5 backports_1.4.1 qvalue_2.26.0
## [58] gridtext_0.1.4 clusterProfiler_4.2.2 tools_4.1.2
## [61] bookdown_0.24 ggplotify_0.1.0 ellipsis_0.3.2
## [64] jquerylib_0.1.4 RColorBrewer_1.1-2 skmeans_0.2-14
## [67] Rcpp_1.0.7 plyr_1.8.6 zlibbioc_1.40.0
## [70] purrr_0.3.4 RCurl_1.98-1.5 GetoptLong_1.1.0
## [73] viridis_0.6.2 zoo_1.8-9 ggrepel_0.9.1
## [76] cluster_2.1.2 magrittr_2.0.1 data.table_1.14.2
## [79] magick_2.7.3 blogdown_0.19 DO.db_2.9
## [82] matrixStats_0.61.0 patchwork_1.1.1 evaluate_0.14
## [85] xtable_1.8-4 XML_3.99-0.8 mclust_5.4.9
## [88] gridExtra_2.3 shape_1.4.6 compiler_4.1.2
## [91] tibble_3.1.6 crayon_1.4.2 shadowtext_0.1.1
## [94] proxyC_0.2.4 htmltools_0.5.2 ggfun_0.0.4
## [97] tidyr_1.1.4 aplot_0.1.2 RcppParallel_5.1.5
## [100] DBI_1.1.2 tweenr_1.0.2 MASS_7.3-55
## [103] Matrix_1.4-0 car_3.0-12 parallel_4.1.2
## [106] igraph_1.2.11 pkgconfig_2.0.3 km.ci_0.5-2
## [109] microbenchmark_1.4.9 xml2_1.3.3 foreach_1.5.1
## [112] ggtree_3.2.1 annotate_1.72.0 bslib_0.3.1
## [115] XVector_0.34.0 yulab.utils_0.0.4 stringr_1.4.0
## [118] digest_0.6.29 tm_0.7-8 Biostrings_2.62.0
## [121] rmarkdown_2.11 cellranger_1.1.0 fastmatch_1.1-3
## [124] survMisc_0.5.5 tidytree_0.3.7 rjson_0.2.21
## [127] lifecycle_1.0.1 nlme_3.1-155 jsonlite_1.7.2
## [130] carData_3.0-5 viridisLite_0.4.0 fansi_1.0.2
## [133] pillar_1.6.4 lattice_0.20-45 KEGGREST_1.34.0
## [136] fastmap_1.1.0 httr_1.4.2 GO.db_3.14.0
## [139] glue_1.6.0 png_0.1-7 iterators_1.0.13
## [142] bit_4.0.4 ggforce_0.3.3 stringi_1.7.6
## [145] sass_0.4.0 blob_1.2.2 memoise_2.0.1
## [148] dplyr_1.0.7 irlba_2.3.5 ape_5.6-1