One of the most important tasks for omics studies is to find associations. We will develope a complex visualization for studying associations between gene expression, methylation and various histone modifications.
The extension package EnrichedHeatmap of ComplexHeatmap can concentrate certain type of genomic signals to a set of genomic features, represented as a “normalized matrix”, and eventually visualize it as a customized heatmap.
As an example, let’s say TSS is the genomic feature or genomic target, and we want to associated CpG Islands to every TSS to study the spatial distribution of CGIs aroung TSS. We can extend TSS by 5kb both upstream and downstream, split the 2kb regions by 50bp windows. This results in 200 windowns per TSS. Then for each window, we can either check whether the window overlaps to CGI which assigns the window a vaue of 1 or 0, or we can do it more quatitatively, i.e. to calculate the fraction of this window that overlaps to CGI, then the value is continuous between 0 and 1.
If there are values associated with the genomic signals, we can apply certain method to get the average signals for each window.
For the visualization of this special type of matrix, the column order is fixed, but you can control the row orderings which are the orders of genes in this example.
Let’s do a simple example:
library(EnrichedHeatmap)
load(system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap"))
tss = promoters(genes, upstream = 0, downstream = 1)
mat1_trim = normalizeToMatrix(H3K4me3, tss, value_column = "coverage",
extend = 5000, mean_mode = "w0", w = 50, keep = c(0, 0.99))
EnrichedHeatmap(mat1_trim, col = c("white", "red"), name = "H3K4me3")
We will use the Roadmap Epigenomics dataset. All samples can be classified into two groups based on the gene expression. We will look at the epigenomics difference between the two groups and the association between different data types.
In this document, we look at the patterns around TSS.
As in each group, there are multiple samples. For each sample, e.g. with methylation data, there is a “normalized matrix” that is normalized to TSS. The whole data in a certein group, for a type of epigenomic datastype, can be thought of as a three-dimension dataset, where the first dimension are TSS, the second dimension are the locations around TSS, and the third dimention are samples. For each epigenomic datatype, there are the following three objects.
We have precomputed such datasets:
load("datasets/roadmap_normalized_matrices.RData")
The following objects will be used.
expr/SAMPLE: gene expression as well as
the sample meta data.mat_cgi: normalized matrix for CGIs on TSS.meth_mat_corr/meth_mat_mean/meth_mat_diff:
normalized matrix for methylationhist_mat_corr_list/hist_mat_mean_list/hist_mat_diff_list:
normalized matrix for histone modificationsAll matrices (for both normal matrix and normalized matrices) have the same row orders, which are genes show significant difference between the two groups from the expression data.
dim(expr)
## [1] 1832 27
Let’s first try the expression matrix, using the scaled values or unscaled values.
library(ComplexHeatmap)
Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]))
expr_scaled = t(scale(t(expr)))
Heatmap(expr_scaled, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]))
The only use of the scaled values is they can show the two group difference very clearly. However, in this example, the unscaled values also clearly show the two-group difference. Additionally, the unscaled values show the absolute expression levels, which provides more information than the scaled values. Thus, in this example, we use the unscaled values for the gene expression heatmap.
For all other matrix objects, they are normalized matrices. We can
plot mat_cgi for example.
library(EnrichedHeatmap)
EnrichedHeatmap(mat_cgi)
To associated these various types of data, we simply concatenate all
the corresponding heatmaps. There are five histone modifications in
hist_mat_*_list. Here we take H3K4me3 as an
example.
Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1])) +
EnrichedHeatmap(mat_cgi) +
EnrichedHeatmap(meth_mat_corr) +
EnrichedHeatmap(meth_mat_mean) +
EnrichedHeatmap(meth_mat_diff) +
EnrichedHeatmap(hist_mat_corr_list$H3K4me3) +
EnrichedHeatmap(hist_mat_mean_list$H3K4me3) +
EnrichedHeatmap(hist_mat_diff_list$H3K4me3)
We do some adjustments and it already looks nice.
Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]),
show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE,
width = unit(4, "cm")) +
EnrichedHeatmap(mat_cgi) +
EnrichedHeatmap(meth_mat_corr) +
EnrichedHeatmap(meth_mat_mean) +
EnrichedHeatmap(meth_mat_diff) +
EnrichedHeatmap(hist_mat_corr_list$H3K4me3) +
EnrichedHeatmap(hist_mat_mean_list$H3K4me3) +
EnrichedHeatmap(hist_mat_diff_list$H3K4me3)
One important thing for a complex visualization is the color. A careful selected set of colors will greatly improve the visualization.
We use the following color schemes:
library(circlize)
cgi_col = c("white", "darkorange")
cor_col_fun = colorRamp2(c(-1, 0, 1), c("darkgreen", "white", "red"))
meth_mean_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))
generate_diff_color_fun = function(x) {
q = quantile(x, c(0.05, 0.95))
max_q = max(abs(q))
colorRamp2(c(-max_q, 0, max_q), c("#3794bf", "#FFFFFF", "#df8640"))
}
generate_mean_color_fun = function(x) {
colorRamp2(c(0, quantile(x, 0.95)), c("white", "purple"))
}
meth_diff_col_fun = generate_diff_color_fun(meth_mat_diff)
hist_diff_col_fun = generate_diff_color_fun(hist_mat_diff_list$H3K4me3)
hist_mean_col_fun = generate_mean_color_fun(hist_mat_mean_list$H3K4me3)
We also add titles to each heatmap:
ht_list = Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]),
show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE,
column_title = "scaled expr",
width = unit(4, "cm")) +
EnrichedHeatmap(mat_cgi, col = cgi_col,
column_title = "CGI") +
EnrichedHeatmap(meth_mat_corr, col = cor_col_fun,
column_title = "meth corr") +
EnrichedHeatmap(meth_mat_mean, col = meth_mean_col_fun,
column_title = "meth mean") +
EnrichedHeatmap(meth_mat_diff, col = meth_diff_col_fun,
column_title = "meth diff") +
EnrichedHeatmap(hist_mat_corr_list$H3K4me3, col = cor_col_fun,
column_title = "H3K4me3 corr") +
EnrichedHeatmap(hist_mat_mean_list$H3K4me3, col = hist_mean_col_fun,
column_title = "H3K4me3 mean") +
EnrichedHeatmap(hist_mat_diff_list$H3K4me3, col = hist_diff_col_fun,
column_title = "H3K4me3 diff")
ht_list
By observing the heatmaps, we can split all genes by the expression difference and the methylation level (i.e. high/low methylation).
expr_mean = rowMeans(expr[, SAMPLE$subgroup == "subgroup1"]) -
rowMeans(expr[, SAMPLE$subgroup == "subgroup2"])
expr_split = ifelse(expr_mean > 0, "high", "low")
expr_split = factor(expr_split, levels = c("high", "low"))
set.seed(123)
upstream_index = length(attr(meth_mat_mean, "upstream_index"))
meth_split = kmeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))],
centers = 2)$cluster
x = tapply(rowMeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))]),
meth_split, mean)
od = structure(order(x), names = names(x))
meth_split = paste0("cluster", od[as.character(meth_split)])
combined_split = paste(meth_split, expr_split, sep = "|")
draw(ht_list, row_split = combined_split)
By default, the row orders are determined by the first matrix, which is not useful for this example. We can try to cluter other heatmaps.
In general, clustering the normalized matrix gives better visualization on the distribution patterns of the signals around the genomic targets.
draw(ht_list, row_split = combined_split, main_heatmap = 2, cluster_rows = TRUE)
Note: when setting both
main_heatmap = 2, cluster_rows = TRUE in
draw(), you need to update ComplexHeatmap to 2.24.1 or
2.25.2.
Then we can take it as the base visualization and add more customizations. The next figure is the final version of this roadmap dataset visualization.