To demonstrate the use of concatenating heatmaps.
There is an RNASeq dataset. Subgroups from unsupervised clustering are not consistent with the clinical classifications. We suspect the samples may be mixed with other tissues. How to validate it?
We check the expression of corresponding genes in public human tissue datasets, e.g. the GTEx datasets. If a sample is contaminated with a certain tissue, then the expression profile should be similar as in the GTEx datasets.
Gene expression in the original dataset is measured by RPKM values. We first perform a log2-transformation.
rpkm = readRDS("datasets/rpkm.rds")
mat = log2(rpkm + 1)
dim(mat)
## [1] 57820 71
This datasets contains both protein-coding genes and non-protein-coding genes. Following the standard procedure, we first perform unsupervised clustering using the cola package.
library(cola)
mat = adjust_matrix(mat)
res = consensus_partition(mat, top_value_method = "ATC", partition_method = "skmeans", mc.cores = 4, verbose = FALSE)
The best number of subgroups is 3:
select_partition_number(res)
collect_plots(res)
## * plotting empirical cumulative distribution curves of the consensus matrix.
## * plotting consensus classes for all k.
##
## All individual plots can be made by following functions:
## - plot_ecdf(object)
## - collect_classes(object)
## - consensus_heatmap(object, k)
## - membership_heatmap(object, k)
## - get_signatures(object, k)
We next extract signature genes from the 3-subgroup classification, i.e. those genes show significant difference.
sig_df = get_signatures(res, k = 3)
## * 71/71 samples (in 3 classes) remain after filtering by silhouette (>= 0.5).
## * cache hash: b5d8a39422d05a6d478d7b234eef009c (seed 888).
## * calculating row difference between subgroups by Ftest.
## * split rows into 3 groups by k-means clustering.
## * 19151 signatures (48.8%) under fdr < 0.05, group_diff > 0.
## - randomly sample 2000 signatures.
## * making heatmaps for signatures.
dim(sig_df)
## [1] 19151 12
We use the top 1000 most differential signature genes.
top_rows = sig_df$which_row[order(sig_df$fdr)[1:1000]]
m_top = mat[top_rows, ]
Gene IDs are Ensemble IDs with versions encoded. Wer remove the Ensembl versions:
rownames(m_top) = gsub("\\.\\d+$", "", rownames(m_top))
Next we process the GTEx datasets. The dataset is available from https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression.
Here we use the expression on the gene level (the value can be TPM or
RPKM for different versions of the dataset). The file
GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct can
be found from the website.
The complete dataset is very huge, thus we use the
data.table package to read this huge table. Also note
the expression matrix is saved in the .gct format which
looks like:
1000 10
name description sample1 sample2 ...
gene1 desc1 1.4 4.1 ...
gene2 desc2 3.2 1.9 ...
...
where data is available from the second line. The first column are gene IDs and the second column are gene descriptions.
library(data.table)
gtex_all = fread("datasets/GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct", skip = 1, header = TRUE)
gtex_all = as.data.frame(gtex_all)
rownames(gtex_all) = gtex_all[, 1]
gtex_all = gtex_all[, -(1:2)]
gtex_all = as.matrix(gtex_all)
dim(gtex_all)
## [1] 56318 8555
gtex_all[1:4, 1:4]
## GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8 GTEX-111VG-2326-SM-5N9BK GTEX-111YS-2426-SM-5GZZQ
## ENSG00000223972.4 0.00000 0.000000 0.00000000 0.000000
## ENSG00000227232.4 4.93505 7.774427 5.95150328 5.864877
## ENSG00000243485.2 0.00000 0.000000 0.01695928 0.000000
## ENSG00000237613.2 0.00000 0.000000 0.00000000 0.000000
In the GTEx datasets, columns are samples from different tissues. The
tissue information can be found in the file
GTEx_Data_V6_Annotations_SampleAttributesDS.txt also on the
GTEx website. Just be sure you choose the same version as the expression
data.
anno = fread("datasets/GTEx_Data_V6_Annotations_SampleAttributesDS.txt")
head(anno)
## SAMPID SMATSSCR SMCENTER SMPTHNTS SMRIN SMTS SMTSD
## <char> <int> <char> <char> <num> <char> <char>
## 1: GTEX-1117F-0003-SM-58Q7G NA B1 NA Blood Whole Blood
## 2: GTEX-1117F-0003-SM-5DWSB NA B1 NA Blood Whole Blood
## 3: GTEX-1117F-0226-SM-5GZZ7 0 B1 2 pieces, ~15% vessel stroma, rep delineated 6.8 Adipose Tissue Adipose - Subcutaneous
## 4: GTEX-1117F-0426-SM-5EGHI 0 B1 2 pieces, !5% fibrous connective tissue, delineated (rep) 7.1 Muscle Muscle - Skeletal
## 5: GTEX-1117F-0526-SM-5EGHJ 0 B1 2 pieces, clean, Monckebeg medial sclerosis, rep delineated 8.0 Blood Vessel Artery - Tibial
## 6: GTEX-1117F-0626-SM-5N9CS 1 B1 2 pieces, up to 4mm aderent fat/nerve/vessel, delineated 6.9 Blood Vessel Artery - Coronary
## SMUBRID SMTSISCH SMTSPAX SMTSTPTREF SMNABTCH SMNABTCHT SMNABTCHD SMGEBTCH SMGEBTCHD
## <char> <int> <int> <char> <char> <char> <char> <char> <char>
## 1: 13756 NA NA Actual Death BP-38516 DNA isolation_Whole Blood _QIAGEN Puregene (Manual) 5/2/13 LCSET-4574
## 2: 13756 NA NA Actual Death BP-38516 DNA isolation_Whole Blood _QIAGEN Puregene (Manual) 5/2/13 GTEx_OM25_Dec_01
## 3: 2190 1214 1125 Actual Death BP-43693 RNA Extraction from Paxgene-derived Lysate Plate Based 9/17/13 LCSET-4804 3/5/14
## 4: 11907 1220 1119 Actual Death BP-43495 RNA Extraction from Paxgene-derived Lysate Plate Based 9/12/13 LCSET-4764 2/8/14
## 5: 7610 1221 1120 Actual Death BP-43495 RNA Extraction from Paxgene-derived Lysate Plate Based 9/12/13 LCSET-4764 2/8/14
## 6: 1621 1243 1098 Actual Death BP-43956 RNA Extraction from Paxgene-derived Lysate Plate Based 9/25/13 LCSET-4904 3/22/14
## SMGEBTCHT SMAFRZE SMGTC SME2MPRT SMCHMPRS SMNTRART SMNUMGPS SMMAPRT SMEXNCRT SM550NRM SMGNSDTC SMUNMPRT SM350NRM SMRDLGTH SMMNCPB
## <char> <char> <char> <num> <int> <num> <int> <num> <num> <num> <int> <num> <num> <int> <num>
## 1: USE ME NA NA NA NA NA NA NA NA NA NA NA NA
## 2: USE ME NA NA NA NA NA NA NA NA NA NA NA NA
## 3: TrueSeq.v1 NA NA NA NA NA NA NA NA NA NA NA NA
## 4: TrueSeq.v1 USE ME 0.9077578 60162 0.9822124 953 0.9270139 0.8969035 0.3622489 22148 0.7231962 0.6877921 76 46.48661
## 5: TrueSeq.v1 NA NA NA NA NA NA NA NA NA NA NA NA
## 6: TrueSeq.v1 USE ME 0.9305153 99793 0.9772871 766 0.9369855 0.8351900 0.3121224 25155 0.8186886 0.7671875 76 50.34656
## SME1MMRT SMSFLGTH SMESTLBS SMMPPD SMNTERRT SMRRNANM SMRDTTL SMVQCFL SMMNCV SMTRSCPT SMMPPDPR SMCGLGTH SMGAPPCT SMUNPDRD SMNTRNRT
## <num> <int> <int> <int> <num> <int> <int> <int> <num> <int> <int> <int> <num> <int> <num>
## 1: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4: 0.002466065 191 42552266 56021766 0.01759680 524010 60432502 9592528 0.6146461 132276 26923668 72980 0.04292628 0 0.08530895
## 5: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6: 0.002827843 196 84748070 65897178 0.02247888 499616 70328908 3629138 0.5960037 144188 31815484 52882 0.02993701 0 0.14209704
## SMMPUNRT SMEXPEFF SMMPPDUN SME2MMRT SME2ANTI SMALTALG SME2SNSE SMMFLGTH SME1ANTI SMSPLTRD SMBSMMRT SME1SNSE SME1PCTS SMRRNART SME1MPRT
## <num> <num> <int> <num> <int> <int> <int> <int> <int> <int> <num> <int> <num> <num> <num>
## 1: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4: 0.6704128 0.8314419 40514728 0.003035173 11741699 2623072 11861992 155 12207544 10849322 0.002744708 12393839 50.37863 0.008670996 0.9462699
## 5: NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6: 0.7670994 0.7825610 53949267 0.002766269 13758926 2622233 13856138 175 13910044 11122995 0.002797269 14086179 50.31457 0.007103992 0.9434559
## SMNUM5CD SMDPMPRT SME2PCTS
## <int> <num> <num>
## 1: NA NA NA
## 2: NA NA NA
## 3: NA NA NA
## 4: 837 0.2768038 50.25481
## 5: NA NA NA
## 6: 847 0.1813114 50.17601
Sample IDs are in the column "SAMPID", tissues are in
the column "SMTSD". We generate a list where each element
vector contains sample IDs for the current tissue.
tissue = split(anno$SAMPID, anno$SMTSD)
tissue = tissue[names(tissue) != ""]
We create a tissue-level matrix where we take the mean of all samples in the same tissue.
gtex_tissue = do.call(cbind, lapply(tissue, function(sid) {
l = colnames(gtex_all) %in% sid
if(any(l)) {
rowMeans(as.matrix(gtex_all[, l, drop = FALSE]))
} else {
NULL
}
}))
head(gtex_tissue)
## Adipose - Subcutaneous Adipose - Visceral (Omentum) Adrenal Gland Artery - Aorta Artery - Coronary Artery - Tibial Bladder
## ENSG00000223972.4 5.253178e-03 0.0090625334 0.0112713420 0.0064913781 0.0107213848 0.0025270310 0.009174342
## ENSG00000227232.4 7.220298e+00 6.6113298517 5.3892013747 6.2221548275 6.8373841881 6.1572479254 9.124309843
## ENSG00000243485.2 5.987153e-03 0.0143823031 0.0168087167 0.0038272891 0.0063473426 0.0027649712 0.014222227
## ENSG00000237613.2 6.880974e-05 0.0000000000 0.0000000000 0.0002061127 0.0001000160 0.0000000000 0.000000000
## ENSG00000268020.2 0.000000e+00 0.0000000000 0.0005654169 0.0000000000 0.0000000000 0.0000000000 0.000000000
## ENSG00000240361.1 4.251723e-04 0.0003575944 0.0008263901 0.0000000000 0.0001607869 0.0001117398 0.000000000
## Brain - Amygdala Brain - Anterior cingulate cortex (BA24) Brain - Caudate (basal ganglia) Brain - Cerebellar Hemisphere
## ENSG00000223972.4 0.0015381076 0.006466732 0.0049652994 1.020534e-02
## ENSG00000227232.4 4.4771507829 4.780258255 4.7070904960 1.067803e+01
## ENSG00000243485.2 0.0434607493 0.081009161 0.1143135497 6.284581e-02
## ENSG00000237613.2 0.0000000000 0.000152052 0.0001364206 1.592348e-04
## ENSG00000268020.2 0.0000000000 0.000000000 0.0005929666 1.171120e-03
## ENSG00000240361.1 0.0002844613 0.001084933 0.0012884173 2.077301e-03
## Brain - Cerebellum Brain - Cortex Brain - Frontal Cortex (BA9) Brain - Hippocampus Brain - Hypothalamus
## ENSG00000223972.4 9.988845e-03 0.0057787124 0.0052937919 0.0058332837 0.0040452055
## ENSG00000227232.4 1.110469e+01 6.4691433133 5.3707596075 5.2369453844 5.0220341956
## ENSG00000243485.2 6.058131e-02 0.0914839173 0.1111886339 0.0678335284 0.0758117489
## ENSG00000237613.2 4.710533e-04 0.0005013382 0.0000000000 0.0002816565 0.0003935297
## ENSG00000268020.2 3.971910e-04 0.0026061366 0.0005621474 0.0009445946 0.0000000000
## ENSG00000240361.1 1.044311e-03 0.0024923371 0.0015372000 0.0009807468 0.0007002680
## Brain - Nucleus accumbens (basal ganglia) Brain - Putamen (basal ganglia) Brain - Spinal cord (cervical c-1)
## ENSG00000223972.4 0.0024486807 6.627654e-03 0.006138114
## ENSG00000227232.4 4.7181628658 4.753972e+00 5.559372029
## ENSG00000243485.2 0.1221440886 8.159263e-02 0.024226765
## ENSG00000237613.2 0.0009151911 2.136495e-04 0.000000000
## ENSG00000268020.2 0.0001624539 1.231059e-03 0.000000000
## ENSG00000240361.1 0.0008321162 6.202426e-05 0.000000000
## Brain - Substantia nigra Breast - Mammary Tissue Cells - EBV-transformed lymphocytes Cells - Transformed fibroblasts
## ENSG00000223972.4 0.0053347573 0.0057532282 0.0047575278 0.0081929467
## ENSG00000227232.4 4.8443744656 8.7444986395 4.9503412954 2.7582751613
## ENSG00000243485.2 0.0614059333 0.0157085338 0.0100771667 0.0147034637
## ENSG00000237613.2 0.0002439327 0.0000000000 0.0002101376 0.0001305975
## ENSG00000268020.2 0.0000000000 0.0007699958 0.0000000000 0.0000000000
## ENSG00000240361.1 0.0007122993 0.0003011201 0.0007571487 0.0000000000
## Cervix - Ectocervix Cervix - Endocervix Colon - Sigmoid Colon - Transverse Esophagus - Gastroesophageal Junction Esophagus - Mucosa
## ENSG00000223972.4 0.00000000 0.00000000 0.0124047535 0.0091541217 0.0078852115 0.0071462263
## ENSG00000227232.4 8.84794680 11.42745714 8.4947213583 7.6692119521 7.8145282845 5.7180962992
## ENSG00000243485.2 0.00276221 0.01850351 0.0055805320 0.0106361849 0.0029880885 0.0152353365
## ENSG00000237613.2 0.00000000 0.00000000 0.0002243245 0.0000000000 0.0005747543 0.0000000000
## ENSG00000268020.2 0.00000000 0.00000000 0.0000000000 0.0000000000 0.0006709509 0.0001310969
## ENSG00000240361.1 0.00000000 0.00000000 0.0001426429 0.0001207968 0.0002993776 0.0000000000
## Esophagus - Muscularis Fallopian Tube Heart - Atrial Appendage Heart - Left Ventricle Kidney - Cortex Liver Lung
## ENSG00000223972.4 0.0070279223 0.007864416 0.0060780580 0.0060784869 0.0047466215 0.007395833 0.0108328749
## ENSG00000227232.4 7.2928423360 10.684090376 3.7089290557 2.7618992454 8.2604877576 4.143892196 7.4718997844
## ENSG00000243485.2 0.0039556336 0.024118946 0.0087300958 0.0083390744 0.0586427259 0.004982706 0.0129206349
## ENSG00000237613.2 0.0002546990 0.003106603 0.0000000000 0.0001617615 0.0009675223 0.000000000 0.0001277256
## ENSG00000268020.2 0.0004870794 0.000000000 0.0001857831 0.0000000000 0.0009298800 0.000000000 0.0001281817
## ENSG00000240361.1 0.0003724650 0.008822924 0.0000000000 0.0001516797 0.0012037038 0.001079129 0.0001193418
## Minor Salivary Gland Muscle - Skeletal Nerve - Tibial Ovary Pancreas Pituitary Prostate
## ENSG00000223972.4 0.0060952563 5.209350e-03 1.059927e-02 1.163470e-02 0.0046040329 0.0226078829 1.425530e-02
## ENSG00000227232.4 6.8734334143 3.678154e+00 1.107068e+01 1.086559e+01 5.1066366364 9.6139178415 1.404610e+01
## ENSG00000243485.2 0.0329565241 1.239938e-02 1.303253e-02 1.010755e-02 0.0097909416 0.0359244353 1.660130e-02
## ENSG00000237613.2 0.0000000000 4.316206e-05 1.745498e-04 0.000000e+00 0.0000000000 0.0006272839 0.000000e+00
## ENSG00000268020.2 0.0000000000 4.784921e-05 7.058344e-04 6.998187e-04 0.0000000000 0.0001739266 3.956626e-04
## ENSG00000240361.1 0.0003554187 3.733143e-04 7.349525e-04 3.840165e-04 0.0006101509 0.0004967933 1.993045e-04
## Skin - Not Sun Exposed (Suprapubic) Skin - Sun Exposed (Lower leg) Small Intestine - Terminal Ileum Spleen Stomach
## ENSG00000223972.4 5.997308e-03 7.855048e-03 0.0150996786 1.904407e-02 0.0103461162
## ENSG00000227232.4 1.183911e+01 1.201444e+01 9.1280199723 1.103765e+01 6.7696150226
## ENSG00000243485.2 4.636276e-02 7.460037e-02 0.0122550662 2.004678e-02 0.0152012707
## ENSG00000237613.2 1.991219e-04 0.000000e+00 0.0001077178 3.305415e-04 0.0000000000
## ENSG00000268020.2 0.000000e+00 5.816538e-04 0.0011596937 2.781299e-04 0.0000000000
## ENSG00000240361.1 3.416113e-04 3.468491e-04 0.0000000000 1.009679e-03 0.0004508922
## Testis Thyroid Uterus Vagina Whole Blood
## ENSG00000223972.4 0.258113407 1.140511e-02 1.134091e-02 0.0121569559 0.0276289464
## ENSG00000227232.4 9.725626413 1.062283e+01 1.118985e+01 9.0894905354 6.7398894221
## ENSG00000243485.2 0.369860908 2.485452e-02 1.454852e-02 0.0115614171 0.0070784566
## ENSG00000237613.2 0.003918590 2.288932e-04 0.000000e+00 0.0004805979 0.0006237906
## ENSG00000268020.2 0.003313427 7.205905e-04 5.081389e-04 0.0000000000 0.0000000000
## ENSG00000240361.1 0.022096007 6.010640e-04 1.028935e-03 0.0002685573 0.0010037953
We also need to remove the Ensembl versions.
rownames(gtex_tissue) = gsub("\\.\\d+$", "", rownames(gtex_tissue))
Since the GTEx dataset is RPKM, we also perform the log2-transformation.
gtex_tissue = log2(gtex_tissue + 1)
Now we can integrate the GTEx dataset with our dataset. We adjust
m_top and gtex_tissue to let the same rows
correspond to the same genes in the two matrices.
cn = intersect(rownames(m_top), rownames(gtex_tissue))
m_top = m_top[cn, ]
gtex_tissue = gtex_tissue[cn, ]
Data is ready. To link the two sources of datasets, we simply concatenate the two heatmaps.
library(ComplexHeatmap)
Heatmap(m_top) + Heatmap(gtex_tissue, show_row_names = FALSE)
The following figure is the original analysis also integrating TCGA datasets.