Complex Epigenomic Associations

Purpose

One of the most important tasks for omics studies is to find associations. We will develope a complex visualization for studying associations between gene expression, methylation and various histone modifications.

Method

The extension package EnrichedHeatmap of ComplexHeatmap can concentrate certain type of genomic signals to a set of genomic features, represented as a “normalized matrix”, and eventually visualize it as a customized heatmap.

As an example, let’s say TSS is the genomic feature or genomic target, and we want to associated CpG Islands to every TSS to study the spatial distribution of CGIs aroung TSS. We can extend TSS by 5kb both upstream and downstream, split the 2kb regions by 50bp windows. This results in 200 windowns per TSS. Then for each window, we can either check whether the window overlaps to CGI which assigns the window a vaue of 1 or 0, or we can do it more quatitatively, i.e. to calculate the fraction of this window that overlaps to CGI, then the value is continuous between 0 and 1.

If there are values associated with the genomic signals, we can apply certain method to get the average signals for each window.

For the visualization of this special type of matrix, the column order is fixed, but you can control the row orderings which are the orders of genes in this example.

Let’s do a simple example:

library(EnrichedHeatmap)
load(system.file("extdata", "chr21_test_data.RData", package = "EnrichedHeatmap"))
tss = promoters(genes, upstream = 0, downstream = 1)
mat1_trim = normalizeToMatrix(H3K4me3, tss, value_column = "coverage", 
    extend = 5000, mean_mode = "w0", w = 50, keep = c(0, 0.99))
EnrichedHeatmap(mat1_trim, col = c("white", "red"), name = "H3K4me3")

Dataset

We will use the Roadmap Epigenomics dataset. All samples can be classified into two groups based on the gene expression. We will look at the epigenomics difference between the two groups and the association between different data types.

In this document, we look at the patterns around TSS.

As in each group, there are multiple samples. For each sample, e.g. with methylation data, there is a “normalized matrix” that is normalized to TSS. The whole data in a certein group, for a type of epigenomic datastype, can be thought of as a three-dimension dataset, where the first dimension are TSS, the second dimension are the locations around TSS, and the third dimention are samples. For each epigenomic datatype, there are the following three objects.

a normalized matrix showing correlations of signals in each window to gene expression. This is done by calculating the correlation of the vector associated to each window and the corresponding gene expression vector of that gene.
Next for the two group, we transform the two three-dimension array(-like) to two “mean” normaliezd matrix, where each “mean” matrix is aggregated on the third dimension. Let’s call them mean1 and mean2.
- A final mean matrix which is the mean of mean1 and mean2, which meausres the absolute levels of signals.
- A difference metrix, which is mean1 - mean2, which measures the difference between the two groups.

We have precomputed such datasets:

load("datasets/roadmap_normalized_matrices.RData")

The following objects will be used.

expr/SAMPLE: gene expression as well as the sample meta data.
mat_cgi: normalized matrix for CGIs on TSS.
meth_mat_corr/meth_mat_mean/meth_mat_diff: normalized matrix for methylation
hist_mat_corr_list/hist_mat_mean_list/hist_mat_diff_list: normalized matrix for histone modifications

All matrices (for both normal matrix and normalized matrices) have the same row orders, which are genes show significant difference between the two groups from the expression data.

dim(expr)

## [1] 1832   27

Analysis

Let’s first try the expression matrix, using the scaled values or unscaled values.

library(ComplexHeatmap)
Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]))

expr_scaled = t(scale(t(expr)))
Heatmap(expr_scaled, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]))

The only use of the scaled values is they can show the two group difference very clearly. However, in this example, the unscaled values also clearly show the two-group difference. Additionally, the unscaled values show the absolute expression levels, which provides more information than the scaled values. Thus, in this example, we use the unscaled values for the gene expression heatmap.

For all other matrix objects, they are normalized matrices. We can plot mat_cgi for example.

library(EnrichedHeatmap)
EnrichedHeatmap(mat_cgi)

To associated these various types of data, we simply concatenate all the corresponding heatmaps. There are five histone modifications in hist_mat_*_list. Here we take H3K4me3 as an example.

Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1])) +
EnrichedHeatmap(mat_cgi) +
EnrichedHeatmap(meth_mat_corr) +
EnrichedHeatmap(meth_mat_mean) +
EnrichedHeatmap(meth_mat_diff) +
EnrichedHeatmap(hist_mat_corr_list$H3K4me3) +
EnrichedHeatmap(hist_mat_mean_list$H3K4me3) +
EnrichedHeatmap(hist_mat_diff_list$H3K4me3)

We do some adjustments and it already looks nice.

Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]),
    show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE,
    width = unit(4, "cm")) +
EnrichedHeatmap(mat_cgi) +
EnrichedHeatmap(meth_mat_corr) +
EnrichedHeatmap(meth_mat_mean) +
EnrichedHeatmap(meth_mat_diff) +
EnrichedHeatmap(hist_mat_corr_list$H3K4me3) +
EnrichedHeatmap(hist_mat_mean_list$H3K4me3) +
EnrichedHeatmap(hist_mat_diff_list$H3K4me3)

One important thing for a complex visualization is the color. A careful selected set of colors will greatly improve the visualization.

We use the following color schemes:

CGI: white-orange
methylation: blue-white-red
histome modification: white-purple
correlation: green-white-red
difference: blue-white-brown

library(circlize)
cgi_col = c("white", "darkorange")
cor_col_fun = colorRamp2(c(-1, 0, 1), c("darkgreen", "white", "red"))
meth_mean_col_fun = colorRamp2(c(0, 0.5, 1), c("blue", "white", "red"))

generate_diff_color_fun = function(x) {
    q = quantile(x, c(0.05, 0.95))
    max_q = max(abs(q))
    colorRamp2(c(-max_q, 0, max_q), c("#3794bf", "#FFFFFF", "#df8640"))
}

generate_mean_color_fun = function(x) {
    colorRamp2(c(0, quantile(x, 0.95)), c("white", "purple"))
}

meth_diff_col_fun = generate_diff_color_fun(meth_mat_diff)

hist_diff_col_fun = generate_diff_color_fun(hist_mat_diff_list$H3K4me3)
hist_mean_col_fun = generate_mean_color_fun(hist_mat_mean_list$H3K4me3)

We also add titles to each heatmap:

ht_list = Heatmap(expr, top_annotation = HeatmapAnnotation(df = SAMPLE[, -1]),
        show_row_dend = FALSE, show_column_dend = FALSE, show_column_names = FALSE,
        column_title = "scaled expr",
        width = unit(4, "cm")) +
    EnrichedHeatmap(mat_cgi, col = cgi_col, 
        column_title = "CGI") +
    EnrichedHeatmap(meth_mat_corr, col = cor_col_fun,
        column_title = "meth corr") +
    EnrichedHeatmap(meth_mat_mean, col = meth_mean_col_fun,
        column_title = "meth mean") +
    EnrichedHeatmap(meth_mat_diff, col = meth_diff_col_fun,
        column_title = "meth diff") +
    EnrichedHeatmap(hist_mat_corr_list$H3K4me3, col = cor_col_fun,
        column_title = "H3K4me3 corr") +
    EnrichedHeatmap(hist_mat_mean_list$H3K4me3, col = hist_mean_col_fun,
        column_title = "H3K4me3 mean") +
    EnrichedHeatmap(hist_mat_diff_list$H3K4me3, col = hist_diff_col_fun,
        column_title = "H3K4me3 diff")

ht_list

By observing the heatmaps, we can split all genes by the expression difference and the methylation level (i.e. high/low methylation).

expr_mean = rowMeans(expr[, SAMPLE$subgroup == "subgroup1"]) - 
    rowMeans(expr[, SAMPLE$subgroup == "subgroup2"])
expr_split = ifelse(expr_mean > 0, "high", "low")
expr_split = factor(expr_split, levels = c("high", "low"))

set.seed(123)
upstream_index = length(attr(meth_mat_mean, "upstream_index"))
meth_split = kmeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))], 
    centers = 2)$cluster
x = tapply(rowMeans(meth_mat_mean[, seq(round(upstream_index*0.8), round(upstream_index*1.4))]), 
    meth_split, mean)
od = structure(order(x), names = names(x))
meth_split = paste0("cluster", od[as.character(meth_split)])
combined_split = paste(meth_split, expr_split, sep = "|")

draw(ht_list, row_split = combined_split)

By default, the row orders are determined by the first matrix, which is not useful for this example. We can try to cluter other heatmaps.

In general, clustering the normalized matrix gives better visualization on the distribution patterns of the signals around the genomic targets.

draw(ht_list, row_split = combined_split, main_heatmap = 2, cluster_rows = TRUE)

Note: when setting both main_heatmap = 2, cluster_rows = TRUE in draw(), you need to update ComplexHeatmap to 2.24.1 or 2.25.2.

Then we can take it as the base visualization and add more customizations. The next figure is the final version of this roadmap dataset visualization.

Complex Epigenomic Associations

Zuguang Gu z.gu@dkfz.de

2025-06-25

Purpose

Method

Dataset

Analysis

Reference