vignettes/work_with_big_datasets.Rmd
work_with_big_datasets.Rmd
cola can be idealy applied to datasets with intermediate number of samples (around 500), however, the running time and the memory usage might not be acceptable if the number of samples gets very huge, e.g., more than thousands. Nevertheless, we can propose the following two strategies to partially solve this problem.
predict_classes()
function for this purpose (see the vignette “Predict classes for new samples” for details).Note, since these two strategies are performed by sampling a small subset of cells from the cohort, the cell clusters with small size might not be detectable.
In the following examples, we use pseudo code to demonstrate the ideas. Assuming the full matrix is in an object mat
. We randomly sample 200 samples from it:
Strategy 1. First we can apply cola on mat1
:
rl = run_all_consensus_partition_methods(mat1, ...)
cola_report(rl, ...)
Then find a best combination of top-value method and partitioining method. Assuming they are saved in objects tm
and pm
. Next run consensus_partition()
on mat
only with tm
and pm
:
res = consensus_partition(mat, top_value_method = tm, partition_method = pm, ...)
Strategy 2. Similar as Strategy 1, we get the ConsensusPartition
object from methods tm
and pm
, which was applied on mat1
:
res = rl[tm, pm]
Note the consensus partition object res
is only based on a subset of original samples. With res
, the classes of remaining samples can be predicted:
mat2 = mat[, setdiff(seq_len(ncol(mat)), ind)]
mat2 = t(scale(t(mat2)))
cl = predict_classes(res, mat2)
Please note, by default mat1
is scaled in cola analysis, correspondingly, mat2
should also be row-scaled.
You can also directly send mat
to predict_classes()
function:
cl = predict_classes(res, t(scale(t(mat))))
DownSamplingConsensusPartition
class
cola implements a new DownSamplingConsensusPartition
class for Strategy 2 mentioned above. It runs consensus partitioning only on a subset of samples and the classes of other samples are predicted by predict_classes()
internally. The DownSamplingConsensusPartition
class inherits the ConsensusPartition
class, so the functions that can be applied to the ConsensusPartition
objects can also be applied to the DownSamplingConsensusPartition
objects, just with some tiny difference.
In the following example, we demonstrate the usage of DownSamplingConsensusPartition
class with the Golub dataset. Here the Golub dataset was used to generate another data object golub_cola
and we can extract the matrix and the annotations by get_matrix()
and get_anno()
functions.
To perform down sampling consensus partitioning, use the helper function consensus_partition_by_down_sampling()
. This function basically run consensus_partition()
on a subset of samples and later predict the classes for all samples by predict_classes()
function. Here we set subset
argument to 50, which means to randomly sample 50 samples from the whole dataset.
data(golub_cola)
m = get_matrix(golub_cola)
set.seed(123)
golub_cola_ds = consensus_partition_by_down_sampling(m, subset = 50,
anno = get_anno(golub_cola), anno_col = get_anno_col(golub_cola),
top_value_method = "SD", partition_method = "kmeans")
The object golub_cola_ds
is already shipped in the package. Simply load the data object.
data(golub_cola_ds)
golub_cola_ds
## A 'DownSamplingConsensusPartition' object with k = 2, 3, 4, 5, 6.
## On a matrix with 4116 rows and 50 columns, randomly sampled from 72 columns.
## Top rows (412) are extracted by 'ATC' method.
## Subgroups are detected by 'skmeans' method.
## Performed in total 250 partitions by row resampling.
## Best k for subgroups seems to be 2.
##
## Following methods can be applied to this 'DownSamplingConsensusPartition' object:
## [1] "cola_report" "collect_classes" "collect_plots"
## [4] "collect_stats" "colnames" "compare_partitions"
## [7] "compare_signatures" "consensus_heatmap" "dimension_reduction"
## [10] "functional_enrichment" "get_anno" "get_anno_col"
## [13] "get_classes" "get_consensus" "get_matrix"
## [16] "get_membership" "get_param" "get_signatures"
## [19] "get_stats" "is_best_k" "is_stable_k"
## [22] "membership_heatmap" "ncol" "nrow"
## [25] "plot_ecdf" "predict_classes" "rownames"
## [28] "select_partition_number" "show" "suggest_best_k"
## [31] "test_to_known_factors" "top_rows_heatmap"
The summary of the golub_cola_ds
is very similar as the ConsensusPartition
object, except that it mentions the object is generated from 50 samples randomly sampled from 72 samples.
All the functions that are applied to the ConsensusParition
class can be applied to the DownSamplingConsensusPartition
class, except some tiny differences.
get_classes()
returns the predicted classes for all samples:
class = get_classes(golub_cola_ds, k = 2)
nrow(class)
## [1] 72
class
## class p
## sample_39 2 0.000
## sample_40 2 0.000
## sample_42 1 0.000
## sample_47 1 0.000
## sample_48 1 0.000
## sample_49 2 0.000
## sample_41 1 0.000
## sample_43 1 0.000
## sample_44 1 0.000
## sample_45 1 0.000
## sample_46 1 0.000
## sample_70 1 0.000
## sample_71 1 0.000
## sample_72 1 0.000
## sample_68 1 0.000
## sample_69 1 0.000
## sample_67 1 0.000
## sample_55 2 0.000
## sample_56 2 0.000
## sample_59 1 0.000
## sample_52 2 0.000
## sample_53 2 0.000
## sample_51 2 0.000
## sample_50 2 0.000
## sample_54 1 0.000
## sample_57 2 0.000
## sample_58 2 0.000
## sample_60 2 0.664
## sample_61 2 0.000
## sample_65 2 0.000
## sample_66 1 0.000
## sample_63 2 0.000
## sample_64 2 0.000
## sample_62 2 0.000
## sample_1 2 0.000
## sample_2 1 0.000
## sample_3 2 0.000
## sample_4 2 0.000
## sample_5 1 0.000
## sample_6 2 0.000
## sample_7 2 0.000
## sample_8 2 0.000
## sample_9 1 0.000
## sample_10 2 0.000
## sample_11 1 0.000
## sample_12 2 0.000
## sample_13 1 0.000
## sample_14 1 0.000
## sample_15 1 0.000
## sample_16 1 0.000
## sample_17 1 0.000
## sample_18 2 0.000
## sample_19 1 0.000
## sample_20 1 0.000
## sample_21 1 0.000
## sample_22 2 0.000
## sample_23 2 0.000
## sample_24 1 0.000
## sample_25 2 0.000
## sample_26 1 0.000
## sample_27 2 0.000
## sample_34 2 0.000
## sample_35 2 0.000
## sample_36 2 0.000
## sample_37 2 0.000
## sample_38 2 0.000
## sample_28 2 0.000
## sample_29 1 0.000
## sample_30 2 0.000
## sample_31 2 0.000
## sample_32 2 0.000
## sample_33 2 0.000
There is an additional column named p
which is the p-value for predicting the class labels. For more details of how the p-value is calculated, please refer to the vignette “Predict classes for new samples”.
If the argument k
is not specified or k
is specified as a vector, the class labels for all k
are returned. Now you can set the p_cutoff
argument so that the class label with p-value larger than this is set to NA
.
get_classes(golub_cola_ds, p_cutoff = 0.05)
## k=2 k=3 k=4 k=5 k=6
## sample_39 2 2 2 2 NA
## sample_40 2 3 3 3 NA
## sample_42 1 1 4 4 3
## sample_47 1 1 1 1 1
## sample_48 1 1 1 1 1
## sample_49 2 3 3 NA 2
## sample_41 1 1 1 1 1
## sample_43 1 1 1 1 1
## sample_44 1 1 1 1 1
## sample_45 1 1 1 1 1
## sample_46 1 1 1 NA NA
## sample_70 1 NA NA NA NA
## sample_71 1 NA 4 4 3
## sample_72 1 1 4 4 3
## sample_68 1 1 1 1 1
## sample_69 1 1 1 1 1
## sample_67 1 1 4 4 3
## sample_55 2 3 3 NA NA
## sample_56 2 3 3 NA NA
## sample_59 1 1 NA NA NA
## sample_52 2 2 2 NA NA
## sample_53 2 2 2 NA NA
## sample_51 2 2 2 2 2
## sample_50 2 2 2 2 2
## sample_54 1 1 1 NA NA
## sample_57 2 2 2 2 2
## sample_58 2 2 2 2 2
## sample_60 NA 1 NA NA NA
## sample_61 2 2 2 2 2
## sample_65 2 2 2 2 2
## sample_66 1 1 1 1 1
## sample_63 2 2 2 2 2
## sample_64 2 2 2 2 2
## sample_62 2 2 2 2 2
## sample_1 2 3 3 3 NA
## sample_2 1 1 4 4 3
## sample_3 2 3 3 3 3
## sample_4 2 3 3 3 NA
## sample_5 1 1 1 1 1
## sample_6 2 3 3 3 NA
## sample_7 2 3 3 3 2
## sample_8 2 3 3 NA 2
## sample_9 1 1 1 NA NA
## sample_10 2 3 3 NA 3
## sample_11 1 1 4 4 3
## sample_12 2 2 2 2 2
## sample_13 1 1 1 1 1
## sample_14 1 1 1 1 NA
## sample_15 1 1 1 1 1
## sample_16 1 1 1 1 1
## sample_17 1 1 1 1 NA
## sample_18 2 3 3 NA NA
## sample_19 1 1 1 1 1
## sample_20 1 1 1 1 1
## sample_21 1 1 1 1 1
## sample_22 2 2 NA NA NA
## sample_23 2 3 3 3 3
## sample_24 1 1 1 1 1
## sample_25 2 2 2 NA NA
## sample_26 1 1 NA NA NA
## sample_27 2 3 3 3 2
## sample_34 2 2 2 2 2
## sample_35 2 2 2 2 2
## sample_36 2 2 2 2 2
## sample_37 2 2 2 2 2
## sample_38 2 2 2 2 2
## sample_28 2 2 2 2 NA
## sample_29 1 1 4 4 3
## sample_30 2 2 2 NA NA
## sample_31 2 2 2 2 2
## sample_32 2 2 2 2 2
## sample_33 2 2 2 2 2
There are several functions uses p_cutoff
argument which controls the number of “usable samples” or the samples with reliable classifications. These functions are get_classes()
, test_to_known_factors()
, dimension_reduction()
and get_signatures()
.
For dimension_reduction()
function, the samples with p-value higher than p_cutoff
are marked by crosses. The samples that were not selected in the random sampling were mapped with smaller dots.
dimension_reduction(golub_cola_ds, k = 2)
## use UMAP
For get_signatures()
function, the signatures are only found in the samples with p-values less than p_cutoff
.
get_signatures(golub_cola_ds, k = 2)
## * 72/72 samples (in 2 classes) remain after filtering by p-value (<= 1).
## * cache hash: f69900f48bd1fbe0f5185b1b5ab034a0 (seed 888).
## * calculating row difference between subgroups by Ftest.
## * split rows into 2 groups by k-means clustering.
## * 1802 signatures (43.8%) under fdr < 0.05, group_diff > 0.
## * making heatmaps for signatures.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## Random number generation:
## RNG: L'Ecuyer-CMRG
## Normal: Inversion
## Sample: Rejection
##
## locale:
## [1] C/UTF-8/C/C/C/C
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] cola_2.9.1 knitr_1.44 markdown_1.10
##
## loaded via a namespace (and not attached):
## [1] blob_1.2.4 bitops_1.0-7 Biostrings_2.68.1 RCurl_1.98-1.12
## [5] fastmap_1.1.1 XML_3.99-0.14 digest_0.6.33 lifecycle_1.0.4
## [9] cluster_2.1.4 Cairo_1.6-2 survival_3.5-8 KEGGREST_1.40.1
## [13] RSQLite_2.3.1 magrittr_2.0.3 genefilter_1.82.1 compiler_4.3.1
## [17] rlang_1.1.2 sass_0.4.8 rngtools_1.5.2 tools_4.3.1
## [21] yaml_2.3.7 askpass_1.2.0 brew_1.0-8 doRNG_1.8.6
## [25] bit_4.0.5 mclust_6.0.0 reticulate_1.34.0 xml2_1.3.6
## [29] eulerr_7.0.0 RColorBrewer_1.1-3 purrr_1.0.2 BiocGenerics_0.48.1
## [33] desc_1.4.2 grid_4.3.1 stats4_4.3.1 xtable_1.8-4
## [37] colorspace_2.1-0 iterators_1.0.14 cli_3.6.2 rmarkdown_2.25
## [41] crayon_1.5.2 ragg_1.2.6 umap_0.2.10.0 RSpectra_0.16-1
## [45] httr_1.4.7 rjson_0.2.21 DBI_1.1.3 cachem_1.0.8
## [49] stringr_1.5.0 splines_4.3.1 zlibbioc_1.46.0 parallel_4.3.1
## [53] AnnotationDbi_1.62.2 impute_1.74.1 XVector_0.40.0 matrixStats_1.2.0
## [57] vctrs_0.6.4 Matrix_1.6-1.1 jsonlite_1.8.8 slam_0.1-50
## [61] IRanges_2.36.0 GetoptLong_1.0.5 S4Vectors_0.40.2 bit64_4.0.5
## [65] irlba_2.3.5.1 clue_0.3-65 magick_2.8.0 systemfonts_1.0.5
## [69] foreach_1.5.2 jquerylib_0.1.4 annotate_1.78.0 glue_1.6.2
## [73] pkgdown_2.0.7 codetools_0.2-19 stringi_1.7.12 shape_1.4.6
## [77] GenomeInfoDb_1.36.4 ComplexHeatmap_2.18.0 htmltools_0.5.7 openssl_2.1.1
## [81] GenomeInfoDbData_1.2.10 circlize_0.4.15 R6_2.5.1 microbenchmark_1.4.10
## [85] textshaping_0.3.7 doParallel_1.0.17 rprojroot_2.0.3 evaluate_0.22
## [89] lattice_0.21-9 Biobase_2.60.0 png_0.1-8 memoise_2.0.1
## [93] bslib_0.6.1 Rcpp_1.0.11 xfun_0.40 fs_1.6.3
## [97] MatrixGenerics_1.12.3 skmeans_0.2-16 GlobalOptions_0.1.2