6 min read

Helper functions for genome-scale heatmap

In my previous blog post, I demonstrated how to make a genome-scale heatmap with multiple other tracks. The key thing there is to split the genome into bins and to normalize various genomic signals to average them into every bin of the genome. From ComplexHeatmap version 2.7.1.1003, I added some helper functions which simplify the binning of the genome and the overlapping between genomic bins and genomic signals with two new functions bin_genome() and normalize_genomic_signals_to_bins(). Here I will introduce the usage of these two functions.

The usage of bin_genome() is very straightforward. You just select the genome and bin size, and optional, a subset of chromosomes.

library(ComplexHeatmap)
chr_window = bin_genome("hg19")
chr_window
## GRanges object with 1990 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##      [1]     chr1         1-1547839      *
##      [2]     chr1   1547840-3095678      *
##      [3]     chr1   3095679-4643517      *
##      [4]     chr1   4643518-6191356      *
##      [5]     chr1   6191357-7739195      *
##      ...      ...               ...    ...
##   [1986]     chrY 51078688-52626526      *
##   [1987]     chrY 52626527-54174365      *
##   [1988]     chrY 54174366-55722204      *
##   [1989]     chrY 55722205-57270043      *
##   [1990]     chrY 57270044-58817882      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

The returned value chr_window is a GRanges object which contains the genomic bins.

There are several parameters you can set in bin_genome().

  • species: Abbreviation of the genome, e.g., “hg19” or “mm10”.
  • bins: Number of bins to split the genome. The final number will be approximately equal to this number. bins is for calculating a proper bin_size value.
  • bin_size: The bin size. If it is set, bins is ignored.
  • ...: All the other arguments are passed to circlize::read.chromInfo(), e.g. you can set the genome by a data frame if it is not available on UCSC database, or you can set a subset of chromosomes by chromosome.index argument.

The second function normalize_genomic_signals_to_bins(), as the name tells, is for normalizing the genomic signals to the genomic bins. The usage is:

x = normalize_genomic_signals_to_bins(gr, value, value_column, method, empty_value)

The arguments are:

It supports following values:

  • When neither value nor value_column is set, it simply overlaps gr to the genomic bins and returns a one-column logical matrix which represents whether the current genomic bin overlaps to any signal.
  • When the signals are numeric, value can be a numeric vector or a matrix, or value_column can contain multiple columns. The function returns a numeric matrix where the values are properly averaged depending on what method was used.
  • When the signals are character, value can only be a vector or value_column can only contain one single column. The function returns a one-column character matrix.

You don’t need to provide the the genomic bins for normalize_genomic_signals_to_bins() because when bin_genome() is executed, the genomic bins are saved internally, so every use of normalize_genomic_signals_to_bins() can directly use the saved genomic bins and it ensures multiple use of normalize_genomic_signals_to_bins() always generate the matrices with the same rows.

That is basically everything, normalize_genomic_signals_to_bins() generates a matrix and you can use it directly with function Heatmap() or other heatmap-related fucntions.

The following is the example I generated in my previous blog post, but here I rewrite it with the use of bin_genome() and normalize_genomic_signals_to_bins().

library(ComplexHeatmap)
library(circlize)
library(GenomicRanges)

chr_window = bin_genome("hg19")

#### the first is a numeric matrix #######
bed1 = generateRandomBed(nr = 1000, nc = 10)
gr1 = GRanges(seqnames = bed1[, 1], ranges = IRanges(bed1[, 2], bed1[, 3]))

num_mat = normalize_genomic_signals_to_bins(gr1, bed1[, -(1:3)])

#### the second is a character matrix ######
bed_list = lapply(1:10, function(i) {
    generateRandomBed(nr = 1000, nc = 1, 
        fun = function(n) sample(c("gain", "loss"), n, replace = TRUE))
})
char_mat = NULL
for(i in 1:10) {
    bed = bed_list[[i]]
    bed = bed[sample(nrow(bed), 20), , drop = FALSE]
    gr_cnv = GRanges(seqnames = bed[, 1], ranges = IRanges(bed[, 2], bed[, 3]))

    char_mat = cbind(char_mat, normalize_genomic_signals_to_bins(gr_cnv, bed[, 4]))
}

#### two numeric columns ##########
bed2 = generateRandomBed(nr = 100, nc = 2)
gr2 = GRanges(seqnames = bed2[, 1], ranges = IRanges(bed2[, 2], bed2[, 3]))

v = normalize_genomic_signals_to_bins(gr2, bed2[, 4:5])

##### a list of genes need to be marked
bed3 = generateRandomBed(nr = 40, nc = 0)
gr3 = GRanges(seqnames = bed3[, 1], ranges = IRanges(bed3[, 2], bed3[, 2]))
gr3$gene = paste0("gene_", 1:length(gr3))

mtch = as.matrix(findOverlaps(chr_window, gr3))
at = mtch[, 1]
labels = mcols(gr3)[mtch[, 2], 1]

##### order of the chromosomes ########
chr = as.vector(seqnames(chr_window))
chr_level = paste0("chr", c(1:22, "X", "Y"))
chr = factor(chr, levels = chr_level)

#### make the heatmap #######
subgroup = rep(c("A", "B"), each = 5)

ht_opt$TITLE_PADDING = unit(c(4, 4), "points")
ht_list = Heatmap(num_mat, name = "mat", col = colorRamp2(c(-1, 0, 1), c("green", "white", "red")),
        row_split = chr, cluster_rows = FALSE, show_column_dend = FALSE,
        column_split = subgroup, cluster_column_slices = FALSE,
        column_title = "numeric matrix",
        top_annotation = HeatmapAnnotation(subgroup = subgroup, annotation_name_side = "left"),
        row_title_rot = 0, row_title_gp = gpar(fontsize = 10), border = TRUE,
        row_gap = unit(0, "points")) +
    Heatmap(char_mat, name = "CNV", col = c("gain" = "red", "loss" = "blue"),
        border = TRUE, column_title = "character matrix") +
    rowAnnotation(label = anno_mark(at = at, labels = labels)) +
    rowAnnotation(pt = anno_points(v, gp = gpar(col = 4:5), pch = c(1, 16)), 
        width = unit(2, "cm")) +
    rowAnnotation(bar = anno_barplot(v[, 1], gp = gpar(col = ifelse(v[ ,1] > 0, 2, 3))), 
        width = unit(2, "cm"))
draw(ht_list, merge_legend = TRUE)

Session Info:

sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
##  [1] parallel  stats4    grid      stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
## [1] GenomicRanges_1.40.0      GenomeInfoDb_1.24.2      
## [3] IRanges_2.22.2            S4Vectors_0.26.1         
## [5] BiocGenerics_0.34.0       circlize_0.4.12          
## [7] ComplexHeatmap_2.7.1.1003 GetoptLong_1.0.4         
## [9] knitr_1.30               
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5             EnrichedHeatmap_1.21.1 compiler_4.0.2        
##  [4] RColorBrewer_1.1-2     XVector_0.28.0         bitops_1.0-6          
##  [7] tools_4.0.2            zlibbioc_1.34.0        digest_0.6.27         
## [10] lattice_0.20-41        evaluate_0.14          clue_0.3-57           
## [13] png_0.1-7              rlang_0.4.8            magick_2.5.2          
## [16] yaml_2.2.1             blogdown_0.17          xfun_0.19             
## [19] GenomeInfoDbData_1.2.3 stringr_1.4.0          cluster_2.1.0         
## [22] GlobalOptions_0.1.2    locfit_1.5-9.4         rmarkdown_2.5         
## [25] bookdown_0.21          magrittr_2.0.1         matrixStats_0.57.0    
## [28] htmltools_0.5.0        shape_1.4.5            colorspace_2.0-0      
## [31] stringi_1.5.3          RCurl_1.98-1.2         crayon_1.3.4          
## [34] rjson_0.2.20           Cairo_1.5-12.2