Hierarchical partition

hierarchical_partition(data,
    top_n = NULL,
    top_value_method = "ATC",
    partition_method = "skmeans",
    combination_method =  expand.grid(top_value_method, partition_method),
    anno = NULL, anno_col = NULL,
    mean_silhouette_cutoff = 0.9, min_samples = max(6, round(ncol(data)*0.01)),
    subset = Inf, predict_method = "centroid",
    group_diff = ifelse(scale_rows, 0.5, 0),
    fdr_cutoff = cola_opt$fdr_cutoff,
    min_n_signatures = NULL,
    filter_fun = function(mat) {
    s = rowSds(mat)
    s > quantile(unique(s[s > 1e-10]), 0.05, na.rm = TRUE)
    },
    max_k = 4, scale_rows = TRUE, verbose = TRUE, mc.cores = 1, cores = mc.cores, help = TRUE, ...)

Arguments

data: a numeric matrix where subgroups are found by columns.
top_n: Number of rows with top values.
top_value_method: a single or a vector of top-value methods. Available methods are in all_top_value_methods.
partition_method: a single or a vector of partition methods. Available methods are in all_partition_methods.
combination_method: A list of combinations of top-value methods and partitioning methods. The value can be a two-column data frame where the first column is the top-value methods and the second column is the partitioning methods. Or it can be a vector of combination names in a form of "top_value_method:partitioning_method".
anno: A data frame with known annotation of samples. The annotations will be plotted in heatmaps and the correlation to predicted subgroups will be tested.
anno_col: A list of colors (color is defined as a named vector) for the annotations. If anno is a data frame, anno_col should be a named list where names correspond to the column names in anno.
mean_silhouette_cutoff: The cutoff to test whether partition in current node is stable.
min_samples: the cutoff of number of samples to determine whether to continue looking for subgroups.
group_diff: Pass to get_signatures,ConsensusPartition-method.
fdr_cutoff: Pass to get_signatures,ConsensusPartition-method.
subset: Number of columns to randomly sample.
predict_method: Method for predicting class labels. Possible values are "centroid", "svm" and "randomForest".
min_n_signatures: Minimal number of signatures under the best classification.
filter_fun: A self-defined function which filters the original matrix and returns a submatrix for partitioning.
max_k: maximal number of partitions to try. The function will try 2:max_k partitions. Note this is the number of partitions that will be tried out on each node of the hierarchical partition. Since more subgroups will be found in the whole partition hierarchy, on each node, max_k should not be set to a large value.
scale_rows: Whether rows are scaled?
verbose: whether print message.
mc.cores: multiple cores to use. This argument will be removed in future versions.
cores: Number of cores, or a cluster object returned by makeCluster.
help: Whether to show the help message.
...: pass to consensus_partition

Details

The function looks for subgroups in a hierarchical way.

There is a special way to encode the node in the hierarchy. The length of the node name is the depth of the node in the hierarchy and the substring excluding the last digit is the name node of the parent node. E.g. for the node 0011, the depth is 4 and the parent node is 001.

Value

A HierarchicalPartition-class object. Simply type object in the interactive R session to see which functions can be applied on it.

Author

Zuguang Gu <z.gu@dkfz.de>

Examples

# \dontrun{
set.seed(123)
m = cbind(rbind(matrix(rnorm(20*20, mean = 2, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20)),
          rbind(matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 1, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20)),
          rbind(matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 0, sd = 0.3), nr = 20),
                matrix(rnorm(20*20, mean = 1, sd = 0.3), nr = 20))
         ) + matrix(rnorm(60*60, sd = 0.5), nr = 60)
rh = hierarchical_partition(m, top_value_method = "SD", partition_method = "kmeans")
#> * hierarchical partition on a 60x60 matrix.
#> * running SD:kmeans.
#> * calculate top-values.
#> ================== node 0 ============================
#> * submatrix with 60 columns, node_id: 0.
#> * 3/60 rows are removed for partitioning, due to very small variance.
#> * -------------------------------------------------------
#> * run SD:kmeans on a 57x60 matrix.
#> * calculating SD values.
#> * rows are scaled before sent to partition, method: 'z-score' (x - mean)/sd
#> * get top 6 rows by SD method
#> * wrap results for k = 2
#> * wrap results for k = 3
#> * wrap results for k = 4
#> * adjust class labels between different k.
#> * SD:kmeans used 0.964 secs.
#> * -------------------------------------------------------
#> * select SD:kmeans (2 groups) because this is the only stable partitioning result.
#> * checking number of signatures in the best classification.
#> * best k = 2, partition into 2 subgroups.
#>   ================== node 01 ============================
#>   * submatrix with 20 columns, node_id: 01.
#>   * 3/60 rows are pre-filtered out before partitioning.
#>   * -------------------------------------------------------
#>   * run SD:kmeans on a 57x20 matrix.
#>   * calculating SD values.
#>   * rows are scaled before sent to partition, method: 'z-score' (x - mean)/sd
#>   * get top 6 rows by SD method
#>   * wrap results for k = 2
#>   * wrap results for k = 3
#>   * wrap results for k = 4
#>   * adjust class labels between different k.
#>   * SD:kmeans used 0.885 secs.
#>   * -------------------------------------------------------
#>   * select SD:kmeans (4 groups) as the best partitioning result.
#>   * mean_silhouette score is too small (0.65), stop.
#>   ================== node 02 ============================
#>   * submatrix with 40 columns, node_id: 02.
#>   * 3/60 rows are pre-filtered out before partitioning.
#>   * -------------------------------------------------------
#>   * run SD:kmeans on a 57x40 matrix.
#>   * calculating SD values.
#>   * rows are scaled before sent to partition, method: 'z-score' (x - mean)/sd
#>   * get top 5 rows by SD method
#>   * wrap results for k = 2
#>   * wrap results for k = 3
#>   * wrap results for k = 4
#>   * adjust class labels between different k.
#>   * SD:kmeans used 0.885 secs.
#>   * -------------------------------------------------------
#>   * select SD:kmeans (2 groups) because this is the only stable partitioning result.
#>   * checking number of signatures in the best classification.
#>   * best k = 2, partition into 2 subgroups.
#>     ================== node 021 ============================
#>     * submatrix with 20 columns, node_id: 021.
#>     * 3/60 rows are pre-filtered out before partitioning.
#>     * -------------------------------------------------------
#>     * run SD:kmeans on a 57x20 matrix.
#>     * calculating SD values.
#>     * rows are scaled before sent to partition, method: 'z-score' (x - mean)/sd
#>     * get top 4 rows by SD method
#>     * wrap results for k = 2
#>     * wrap results for k = 3
#>     * wrap results for k = 4
#>     * adjust class labels between different k.
#>     * SD:kmeans used 0.8710001 secs.
#>     * -------------------------------------------------------
#>     * select SD:kmeans (3 groups) as the best partitioning result.
#>     * mean_silhouette score is too small (0.54), stop.
#>     ================== node 022 ============================
#>     * submatrix with 20 columns, node_id: 022.
#>     * 3/60 rows are pre-filtered out before partitioning.
#>     * -------------------------------------------------------
#>     * run SD:kmeans on a 57x20 matrix.
#>     * calculating SD values.
#>     * rows are scaled before sent to partition, method: 'z-score' (x - mean)/sd
#>     * get top 6 rows by SD method
#>     * wrap results for k = 2
#>     * wrap results for k = 3
#>     * wrap results for k = 4
#>     * adjust class labels between different k.
#>     * SD:kmeans used 0.904 secs.
#>     * -------------------------------------------------------
#>     * select SD:kmeans (2 groups) as the best partitioning result.
#>     * mean_silhouette score is too small (0.43), stop.
#> * formatting the results into a HierarchicalPartition object.
#> * totally used 7.833814 secs.
# }