Tissue Contamination

Analysis

Gene expression in the original dataset is measured by RPKM values. We first perform a log2-transformation.

rpkm = readRDS("datasets/rpkm.rds")
mat = log2(rpkm + 1)
dim(mat)

## [1] 57820    71

This datasets contains both protein-coding genes and non-protein-coding genes. Following the standard procedure, we first perform unsupervised clustering using the cola package.

library(cola)
mat = adjust_matrix(mat)
res = consensus_partition(mat, top_value_method = "ATC", partition_method = "skmeans", mc.cores = 4, verbose = FALSE)

The best number of subgroups is 3:

select_partition_number(res)

collect_plots(res)

## * plotting empirical cumulative distribution curves of the consensus matrix.

## * plotting consensus classes for all k.

## 
## All individual plots can be made by following functions:
## - plot_ecdf(object)
## - collect_classes(object)
## - consensus_heatmap(object, k)
## - membership_heatmap(object, k)
## - get_signatures(object, k)

We next extract signature genes from the 3-subgroup classification, i.e. those genes show significant difference.

sig_df = get_signatures(res, k = 3)

## * 71/71 samples (in 3 classes) remain after filtering by silhouette (>= 0.5).
## * cache hash: b5d8a39422d05a6d478d7b234eef009c (seed 888).
## * calculating row difference between subgroups by Ftest.
## * split rows into 3 groups by k-means clustering.
## * 19151 signatures (48.8%) under fdr < 0.05, group_diff > 0.
##   - randomly sample 2000 signatures.
## * making heatmaps for signatures.

dim(sig_df)

## [1] 19151    12

We use the top 1000 most differential signature genes.

top_rows = sig_df$which_row[order(sig_df$fdr)[1:1000]]
m_top = mat[top_rows, ]

Gene IDs are Ensemble IDs with versions encoded. Wer remove the Ensembl versions:

rownames(m_top) = gsub("\\.\\d+$", "", rownames(m_top))

Next we process the GTEx datasets. The dataset is available from https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression. Here we use the expression on the gene level (the value can be TPM or RPKM for different versions of the dataset). The file GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct can be found from the website.

The complete dataset is very huge, thus we use the data.table package to read this huge table. Also note the expression matrix is saved in the .gct format which looks like:

1000    10
name    description sample1 sample2 ...
gene1   desc1       1.4     4.1     ...
gene2   desc2       3.2     1.9     ...
...

where data is available from the second line. The first column are gene IDs and the second column are gene descriptions.

library(data.table)
gtex_all = fread("datasets/GTEx_Analysis_v6_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct", skip = 1, header = TRUE)
gtex_all = as.data.frame(gtex_all)
rownames(gtex_all) = gtex_all[, 1]
gtex_all = gtex_all[, -(1:2)]
gtex_all = as.matrix(gtex_all)
dim(gtex_all)

## [1] 56318  8555

gtex_all[1:4, 1:4]

##                   GTEX-111CU-1826-SM-5GZYN GTEX-111FC-0226-SM-5N9B8 GTEX-111VG-2326-SM-5N9BK GTEX-111YS-2426-SM-5GZZQ
## ENSG00000223972.4                  0.00000                 0.000000               0.00000000                 0.000000
## ENSG00000227232.4                  4.93505                 7.774427               5.95150328                 5.864877
## ENSG00000243485.2                  0.00000                 0.000000               0.01695928                 0.000000
## ENSG00000237613.2                  0.00000                 0.000000               0.00000000                 0.000000

In the GTEx datasets, columns are samples from different tissues. The tissue information can be found in the file GTEx_Data_V6_Annotations_SampleAttributesDS.txt also on the GTEx website. Just be sure you choose the same version as the expression data.

anno = fread("datasets/GTEx_Data_V6_Annotations_SampleAttributesDS.txt")
head(anno)

##                      SAMPID SMATSSCR SMCENTER                                                    SMPTHNTS SMRIN           SMTS                  SMTSD
##                      <char>    <int>   <char>                                                      <char> <num>         <char>                 <char>
## 1: GTEX-1117F-0003-SM-58Q7G       NA       B1                                                                NA          Blood            Whole Blood
## 2: GTEX-1117F-0003-SM-5DWSB       NA       B1                                                                NA          Blood            Whole Blood
## 3: GTEX-1117F-0226-SM-5GZZ7        0       B1                2 pieces, ~15% vessel stroma, rep delineated   6.8 Adipose Tissue Adipose - Subcutaneous
## 4: GTEX-1117F-0426-SM-5EGHI        0       B1   2 pieces, !5% fibrous connective tissue, delineated (rep)   7.1         Muscle      Muscle - Skeletal
## 5: GTEX-1117F-0526-SM-5EGHJ        0       B1 2 pieces, clean, Monckebeg medial sclerosis, rep delineated   8.0   Blood Vessel        Artery - Tibial
## 6: GTEX-1117F-0626-SM-5N9CS        1       B1    2 pieces, up to 4mm aderent fat/nerve/vessel, delineated   6.9   Blood Vessel      Artery - Coronary
##    SMUBRID SMTSISCH SMTSPAX   SMTSTPTREF SMNABTCH                                              SMNABTCHT SMNABTCHD         SMGEBTCH SMGEBTCHD
##     <char>    <int>   <int>       <char>   <char>                                                 <char>    <char>           <char>    <char>
## 1:   13756       NA      NA Actual Death BP-38516    DNA isolation_Whole Blood _QIAGEN Puregene (Manual)    5/2/13       LCSET-4574          
## 2:   13756       NA      NA Actual Death BP-38516    DNA isolation_Whole Blood _QIAGEN Puregene (Manual)    5/2/13 GTEx_OM25_Dec_01          
## 3:    2190     1214    1125 Actual Death BP-43693 RNA Extraction from Paxgene-derived Lysate Plate Based   9/17/13       LCSET-4804    3/5/14
## 4:   11907     1220    1119 Actual Death BP-43495 RNA Extraction from Paxgene-derived Lysate Plate Based   9/12/13       LCSET-4764    2/8/14
## 5:    7610     1221    1120 Actual Death BP-43495 RNA Extraction from Paxgene-derived Lysate Plate Based   9/12/13       LCSET-4764    2/8/14
## 6:    1621     1243    1098 Actual Death BP-43956 RNA Extraction from Paxgene-derived Lysate Plate Based   9/25/13       LCSET-4904   3/22/14
##     SMGEBTCHT SMAFRZE  SMGTC  SME2MPRT SMCHMPRS  SMNTRART SMNUMGPS   SMMAPRT  SMEXNCRT  SM550NRM SMGNSDTC  SMUNMPRT  SM350NRM SMRDLGTH  SMMNCPB
##        <char>  <char> <char>     <num>    <int>     <num>    <int>     <num>     <num>     <num>    <int>     <num>     <num>    <int>    <num>
## 1:             USE ME               NA       NA        NA       NA        NA        NA        NA       NA        NA        NA       NA       NA
## 2:             USE ME               NA       NA        NA       NA        NA        NA        NA       NA        NA        NA       NA       NA
## 3: TrueSeq.v1                       NA       NA        NA       NA        NA        NA        NA       NA        NA        NA       NA       NA
## 4: TrueSeq.v1  USE ME        0.9077578    60162 0.9822124      953 0.9270139 0.8969035 0.3622489    22148 0.7231962 0.6877921       76 46.48661
## 5: TrueSeq.v1                       NA       NA        NA       NA        NA        NA        NA       NA        NA        NA       NA       NA
## 6: TrueSeq.v1  USE ME        0.9305153    99793 0.9772871      766 0.9369855 0.8351900 0.3121224    25155 0.8186886 0.7671875       76 50.34656
##       SME1MMRT SMSFLGTH SMESTLBS   SMMPPD   SMNTERRT SMRRNANM  SMRDTTL SMVQCFL    SMMNCV SMTRSCPT SMMPPDPR SMCGLGTH   SMGAPPCT SMUNPDRD   SMNTRNRT
##          <num>    <int>    <int>    <int>      <num>    <int>    <int>   <int>     <num>    <int>    <int>    <int>      <num>    <int>      <num>
## 1:          NA       NA       NA       NA         NA       NA       NA      NA        NA       NA       NA       NA         NA       NA         NA
## 2:          NA       NA       NA       NA         NA       NA       NA      NA        NA       NA       NA       NA         NA       NA         NA
## 3:          NA       NA       NA       NA         NA       NA       NA      NA        NA       NA       NA       NA         NA       NA         NA
## 4: 0.002466065      191 42552266 56021766 0.01759680   524010 60432502 9592528 0.6146461   132276 26923668    72980 0.04292628        0 0.08530895
## 5:          NA       NA       NA       NA         NA       NA       NA      NA        NA       NA       NA       NA         NA       NA         NA
## 6: 0.002827843      196 84748070 65897178 0.02247888   499616 70328908 3629138 0.5960037   144188 31815484    52882 0.02993701        0 0.14209704
##     SMMPUNRT  SMEXPEFF SMMPPDUN    SME2MMRT SME2ANTI SMALTALG SME2SNSE SMMFLGTH SME1ANTI SMSPLTRD    SMBSMMRT SME1SNSE SME1PCTS    SMRRNART  SME1MPRT
##        <num>     <num>    <int>       <num>    <int>    <int>    <int>    <int>    <int>    <int>       <num>    <int>    <num>       <num>     <num>
## 1:        NA        NA       NA          NA       NA       NA       NA       NA       NA       NA          NA       NA       NA          NA        NA
## 2:        NA        NA       NA          NA       NA       NA       NA       NA       NA       NA          NA       NA       NA          NA        NA
## 3:        NA        NA       NA          NA       NA       NA       NA       NA       NA       NA          NA       NA       NA          NA        NA
## 4: 0.6704128 0.8314419 40514728 0.003035173 11741699  2623072 11861992      155 12207544 10849322 0.002744708 12393839 50.37863 0.008670996 0.9462699
## 5:        NA        NA       NA          NA       NA       NA       NA       NA       NA       NA          NA       NA       NA          NA        NA
## 6: 0.7670994 0.7825610 53949267 0.002766269 13758926  2622233 13856138      175 13910044 11122995 0.002797269 14086179 50.31457 0.007103992 0.9434559
##    SMNUM5CD  SMDPMPRT SME2PCTS
##       <int>     <num>    <num>
## 1:       NA        NA       NA
## 2:       NA        NA       NA
## 3:       NA        NA       NA
## 4:      837 0.2768038 50.25481
## 5:       NA        NA       NA
## 6:      847 0.1813114 50.17601

Sample IDs are in the column "SAMPID", tissues are in the column "SMTSD". We generate a list where each element vector contains sample IDs for the current tissue.

tissue = split(anno$SAMPID, anno$SMTSD)
tissue = tissue[names(tissue) != ""]

We create a tissue-level matrix where we take the mean of all samples in the same tissue.

gtex_tissue = do.call(cbind, lapply(tissue, function(sid) {
    l = colnames(gtex_all) %in% sid
    if(any(l)) {
        rowMeans(as.matrix(gtex_all[, l, drop = FALSE]))
    } else {
        NULL
    }
}))
head(gtex_tissue)

##                   Adipose - Subcutaneous Adipose - Visceral (Omentum) Adrenal Gland Artery - Aorta Artery - Coronary Artery - Tibial     Bladder
## ENSG00000223972.4           5.253178e-03                 0.0090625334  0.0112713420   0.0064913781      0.0107213848    0.0025270310 0.009174342
## ENSG00000227232.4           7.220298e+00                 6.6113298517  5.3892013747   6.2221548275      6.8373841881    6.1572479254 9.124309843
## ENSG00000243485.2           5.987153e-03                 0.0143823031  0.0168087167   0.0038272891      0.0063473426    0.0027649712 0.014222227
## ENSG00000237613.2           6.880974e-05                 0.0000000000  0.0000000000   0.0002061127      0.0001000160    0.0000000000 0.000000000
## ENSG00000268020.2           0.000000e+00                 0.0000000000  0.0005654169   0.0000000000      0.0000000000    0.0000000000 0.000000000
## ENSG00000240361.1           4.251723e-04                 0.0003575944  0.0008263901   0.0000000000      0.0001607869    0.0001117398 0.000000000
##                   Brain - Amygdala Brain - Anterior cingulate cortex (BA24) Brain - Caudate (basal ganglia) Brain - Cerebellar Hemisphere
## ENSG00000223972.4     0.0015381076                              0.006466732                    0.0049652994                  1.020534e-02
## ENSG00000227232.4     4.4771507829                              4.780258255                    4.7070904960                  1.067803e+01
## ENSG00000243485.2     0.0434607493                              0.081009161                    0.1143135497                  6.284581e-02
## ENSG00000237613.2     0.0000000000                              0.000152052                    0.0001364206                  1.592348e-04
## ENSG00000268020.2     0.0000000000                              0.000000000                    0.0005929666                  1.171120e-03
## ENSG00000240361.1     0.0002844613                              0.001084933                    0.0012884173                  2.077301e-03
##                   Brain - Cerebellum Brain - Cortex Brain - Frontal Cortex (BA9) Brain - Hippocampus Brain - Hypothalamus
## ENSG00000223972.4       9.988845e-03   0.0057787124                 0.0052937919        0.0058332837         0.0040452055
## ENSG00000227232.4       1.110469e+01   6.4691433133                 5.3707596075        5.2369453844         5.0220341956
## ENSG00000243485.2       6.058131e-02   0.0914839173                 0.1111886339        0.0678335284         0.0758117489
## ENSG00000237613.2       4.710533e-04   0.0005013382                 0.0000000000        0.0002816565         0.0003935297
## ENSG00000268020.2       3.971910e-04   0.0026061366                 0.0005621474        0.0009445946         0.0000000000
## ENSG00000240361.1       1.044311e-03   0.0024923371                 0.0015372000        0.0009807468         0.0007002680
##                   Brain - Nucleus accumbens (basal ganglia) Brain - Putamen (basal ganglia) Brain - Spinal cord (cervical c-1)
## ENSG00000223972.4                              0.0024486807                    6.627654e-03                        0.006138114
## ENSG00000227232.4                              4.7181628658                    4.753972e+00                        5.559372029
## ENSG00000243485.2                              0.1221440886                    8.159263e-02                        0.024226765
## ENSG00000237613.2                              0.0009151911                    2.136495e-04                        0.000000000
## ENSG00000268020.2                              0.0001624539                    1.231059e-03                        0.000000000
## ENSG00000240361.1                              0.0008321162                    6.202426e-05                        0.000000000
##                   Brain - Substantia nigra Breast - Mammary Tissue Cells - EBV-transformed lymphocytes Cells - Transformed fibroblasts
## ENSG00000223972.4             0.0053347573            0.0057532282                        0.0047575278                    0.0081929467
## ENSG00000227232.4             4.8443744656            8.7444986395                        4.9503412954                    2.7582751613
## ENSG00000243485.2             0.0614059333            0.0157085338                        0.0100771667                    0.0147034637
## ENSG00000237613.2             0.0002439327            0.0000000000                        0.0002101376                    0.0001305975
## ENSG00000268020.2             0.0000000000            0.0007699958                        0.0000000000                    0.0000000000
## ENSG00000240361.1             0.0007122993            0.0003011201                        0.0007571487                    0.0000000000
##                   Cervix - Ectocervix Cervix - Endocervix Colon - Sigmoid Colon - Transverse Esophagus - Gastroesophageal Junction Esophagus - Mucosa
## ENSG00000223972.4          0.00000000          0.00000000    0.0124047535       0.0091541217                          0.0078852115       0.0071462263
## ENSG00000227232.4          8.84794680         11.42745714    8.4947213583       7.6692119521                          7.8145282845       5.7180962992
## ENSG00000243485.2          0.00276221          0.01850351    0.0055805320       0.0106361849                          0.0029880885       0.0152353365
## ENSG00000237613.2          0.00000000          0.00000000    0.0002243245       0.0000000000                          0.0005747543       0.0000000000
## ENSG00000268020.2          0.00000000          0.00000000    0.0000000000       0.0000000000                          0.0006709509       0.0001310969
## ENSG00000240361.1          0.00000000          0.00000000    0.0001426429       0.0001207968                          0.0002993776       0.0000000000
##                   Esophagus - Muscularis Fallopian Tube Heart - Atrial Appendage Heart - Left Ventricle Kidney - Cortex       Liver         Lung
## ENSG00000223972.4           0.0070279223    0.007864416             0.0060780580           0.0060784869    0.0047466215 0.007395833 0.0108328749
## ENSG00000227232.4           7.2928423360   10.684090376             3.7089290557           2.7618992454    8.2604877576 4.143892196 7.4718997844
## ENSG00000243485.2           0.0039556336    0.024118946             0.0087300958           0.0083390744    0.0586427259 0.004982706 0.0129206349
## ENSG00000237613.2           0.0002546990    0.003106603             0.0000000000           0.0001617615    0.0009675223 0.000000000 0.0001277256
## ENSG00000268020.2           0.0004870794    0.000000000             0.0001857831           0.0000000000    0.0009298800 0.000000000 0.0001281817
## ENSG00000240361.1           0.0003724650    0.008822924             0.0000000000           0.0001516797    0.0012037038 0.001079129 0.0001193418
##                   Minor Salivary Gland Muscle - Skeletal Nerve - Tibial        Ovary     Pancreas    Pituitary     Prostate
## ENSG00000223972.4         0.0060952563      5.209350e-03   1.059927e-02 1.163470e-02 0.0046040329 0.0226078829 1.425530e-02
## ENSG00000227232.4         6.8734334143      3.678154e+00   1.107068e+01 1.086559e+01 5.1066366364 9.6139178415 1.404610e+01
## ENSG00000243485.2         0.0329565241      1.239938e-02   1.303253e-02 1.010755e-02 0.0097909416 0.0359244353 1.660130e-02
## ENSG00000237613.2         0.0000000000      4.316206e-05   1.745498e-04 0.000000e+00 0.0000000000 0.0006272839 0.000000e+00
## ENSG00000268020.2         0.0000000000      4.784921e-05   7.058344e-04 6.998187e-04 0.0000000000 0.0001739266 3.956626e-04
## ENSG00000240361.1         0.0003554187      3.733143e-04   7.349525e-04 3.840165e-04 0.0006101509 0.0004967933 1.993045e-04
##                   Skin - Not Sun Exposed (Suprapubic) Skin - Sun Exposed (Lower leg) Small Intestine - Terminal Ileum       Spleen      Stomach
## ENSG00000223972.4                        5.997308e-03                   7.855048e-03                     0.0150996786 1.904407e-02 0.0103461162
## ENSG00000227232.4                        1.183911e+01                   1.201444e+01                     9.1280199723 1.103765e+01 6.7696150226
## ENSG00000243485.2                        4.636276e-02                   7.460037e-02                     0.0122550662 2.004678e-02 0.0152012707
## ENSG00000237613.2                        1.991219e-04                   0.000000e+00                     0.0001077178 3.305415e-04 0.0000000000
## ENSG00000268020.2                        0.000000e+00                   5.816538e-04                     0.0011596937 2.781299e-04 0.0000000000
## ENSG00000240361.1                        3.416113e-04                   3.468491e-04                     0.0000000000 1.009679e-03 0.0004508922
##                        Testis      Thyroid       Uterus       Vagina  Whole Blood
## ENSG00000223972.4 0.258113407 1.140511e-02 1.134091e-02 0.0121569559 0.0276289464
## ENSG00000227232.4 9.725626413 1.062283e+01 1.118985e+01 9.0894905354 6.7398894221
## ENSG00000243485.2 0.369860908 2.485452e-02 1.454852e-02 0.0115614171 0.0070784566
## ENSG00000237613.2 0.003918590 2.288932e-04 0.000000e+00 0.0004805979 0.0006237906
## ENSG00000268020.2 0.003313427 7.205905e-04 5.081389e-04 0.0000000000 0.0000000000
## ENSG00000240361.1 0.022096007 6.010640e-04 1.028935e-03 0.0002685573 0.0010037953

We also need to remove the Ensembl versions.

rownames(gtex_tissue) = gsub("\\.\\d+$", "", rownames(gtex_tissue))

Since the GTEx dataset is RPKM, we also perform the log2-transformation.

gtex_tissue = log2(gtex_tissue + 1)

Now we can integrate the GTEx dataset with our dataset. We adjust m_top and gtex_tissue to let the same rows correspond to the same genes in the two matrices.

cn = intersect(rownames(m_top), rownames(gtex_tissue))

m_top = m_top[cn, ]
gtex_tissue = gtex_tissue[cn, ]

Data is ready. To link the two sources of datasets, we simply concatenate the two heatmaps.

library(ComplexHeatmap)
Heatmap(m_top) + Heatmap(gtex_tissue, show_row_names = FALSE)

The following figure is the original analysis also integrating TCGA datasets.

Tissue Contamination

Zuguang Gu z.gu@dkfz.de

2025-06-25

Purpose

Question

Strategy

Analysis