In this supplementary, we compare different top-value methods for extracting features for subgrouping methylation array dataset, also we compare the different consensus partitioning results based on CpG probes in different CpG categories (i.e. CpG islands (CGI), CGI shores and CGI seas).
The 450K methylation array dataset is from Strum et al., 2012. The dataset is available from GEO database with ID GSE36278. The processing of the dataset can be found here.
Top 5000 CpG probes with the highest standard deviation (SD), coeffcient of variance (CV), median absolute deviation (MAD) and ATC (ability to correlate to other rows) scores are extracted respectively. Additionally, CpG probes are also extracted based on two variants of ATC method:
ATC_cgi_anno
: when calculating ATC score for the ith CpG probe, only the probes which have the same CpG annotation as the ith CpG are used. This method aims to remove the effect of different methylation patterns due to CpGs in different CpG categories.ATC_SD
: this score is calculated as the mean rank of the rank by ATC scores and by SD scores, i.e. rank(rank(ATC) + rank(SD))
. This method aims to integrate two top-value methods into one.CpG probes are categorized into three groups for which the annotation is provided by the IlluminaHumanMethylation450kanno.ilmn12.hg19 package. The three CpG categories are:
Figure S9.1 visualizes the methylation profile for the top 5000 CpG probes extracted from different top-value methods. At the bottom of each heatmap is an annotation (dkfz_subtype
) from the orignial study where the subtypes were predicted by top 8000 genes with the highest SD scores and by k-means consensus clustering. In general, the top 5000 CpG probes by SD, CV and MAD show distinct methylation patterns between subtypes (heatmaps in the first row). CpG probes by SD can separate all six subtypes which is expected because the original subtypes were also based on probes with top SD scores. Probes by CV can separate G34, IDH, K27 and RTK II Classic subtypes but not the other two. SD and MAD extract more similar CpG probes (Figure S9.2) while CV tends to extract probes with lower methylation in general.
As a comparison, top 5000 CpG probes extracted by ATC-related methods have little overlap to probes based on SD/CV/MAD, also the methylation profiles are different (compare the top three heatmaps and the bottom three heatmaps). The top probes by ATC-related methods have very high proportion of CGI sea and they can only separate samples into two major groups (IDH/RTK II Classic/mesenchymal and G34/K27/RTK_I PDGFRA) which are high-methylation group and low-methylation group.
CpGs fall into different categories and different types of CpGs may play different roles in transcriptional regulations. For example, CGIs are enriched at transcriptional start sites (TSS) and they are generally un-methylated for actively expressed genes. A increase of the methylation at gene TSS generally means to suppress the gene expression. CGI shores have more dynamic methylation patterns and CGI seas overlap more to the gene bodies or intergenic regions and normally relate to methylation changes on long-range regions which might relate to the chromatin structure changes.
As already shown in Figure S9.1, the proportion of CpGs in each CpG categories are different in the top 5000 probes extracted from different top-value methods (the annotation on the left of each heatmap in Figure S9.1, also put together in Figure S9.3B). E.g. probes in CGI have higher SD values, while the probes in CGI seas show stronger correlation to each other. Thus, we think it is worthwile to extract features and apply consensus partitioning for probes in different CpG categories separately.
As shown in Figure S9.3B, for the top 5000 CpGs by SD, there are only 20% of the CpG belonging to CGI sea, while 46.2% of the probes in the complete 450K array are in CGI sea (For the 450K methylation array, as visualized in Figure S9.3A, 30.4% of the CpGs belong to CGI, 23.4% belong to CGI shores and 46.2% belong to CGI seas.). Thus, simply taking top CpG features from all probes might miss the interesting patterns from CGI shores or CGI seas.
Similar as Figure S9.1, Figure S9.4 visualizes the top 5000 CpGs from diffferent CpG categories. For CpGs from CGI, SD gives a very clean image that almost all six subtypes can be separated although some of the RTK I PDGFRA samples mixed with Mesenchymal subtype. CV can select probes to nicely separate G34/IDH/RTK II Classic but not the others. MAD selects probes which generate a more noisy profile compared to SD. Probes selected by ATC generally show very similar patterns among all samples, but interestingly, some subtypes can still be separated.
From CGI shore to sea, the methylation profile becomes more noisy and some subtypes cannot be nicely separated any more. ATC now separate samples into the high-methylation and low-methylation groups.
Note in the discussion on Figure S9.4, when we mention “subtypes can be separated”, we refer to the hierarchical clustering applied on samples and visualized on top of each heatmap, while it is not based on the consensus clustering.
According to the discussion in the main manuscript, ATC method works very well for the gene expression datasets. However, as shown in previous figures, ATC performs the worst. We want to know why ATC features are so different and what information they can provide.
Since ATC is based on correlation, in Figure S9.5, we also scaled the methylation by rows by z-score transformation (the right heatmap). For the top 5000 CGI probes, although the methylation profile looks very similar among samples, there is a very clear pattern that separates samples into two groups after z-score transformation (Figure S9.5A, right heatmap). The two-group classification is identified by simply applying k-means clustering on the scaled methylation profile and is marked as km
annotation under the heatmap.
For top 5000 CGI shore probes and CGI sea probes, samples can also be separated into two groups after row-scaling. Being different from CGI probes, all the samples showing consistent high methylation in one group and consistent low methylation in the other group. Also, these two-group classifications are also marked as annotation unber each heatmap.
Comparing the km
annotation to the dkfz_subtype
annotation, we found, the two-group classification for CGI probes has not obvious agreement to the subtypes, while for the CGI shores and seas, IDH/Mesenchymal/RTK II Classic samples show relatively high methylation and K27/G34 samples show relatively low methylation.
As for the CpGs in shores and seas, the relative methylation profiles show singular one-side pattern, we suspect it is due to the global difference. Thus we look at the global methylation distribution of all probes in each CpG category. In Figure S9.6, samples are ordered by the median methylation value. The bottom annotations show whether the sample belong to the km
groups (the same as in Figure S9.5) or the subtypes.
For methylation of CGI probes, the km
or the subtype classifications have no relation to the increasing methylation levels among samples (Figure S9.6A), while for CGI shore/sea probes, samples labelled as km = 1
have systematic high level of methylation. Samples in different subtypes also show different methylation levels, e.g. IDH samples show the highest methylation and K27/G34 show the lowest methylation.
As shown in Figure S9.1 and S9.4, we can make the conclusion that SD selects better probes for subgroup classification. Thus, we extract the consensus partitioning results by SD:skmeans
for the three CpG categories, as well as taking all probes. The reason of selecting skmeans
is because it generates stable partitions. The classes predicted (the number of subgroups are based on the suggest_best_k()
function in cola package) as well as the dkfz_subtype
annotation are visualized in Figure S9.7.
We see, all
/island
/shore
generate similar classifications. A subset of Mesenchymal samples are classified into the same group as RTK II Classic. Note the dkfz_subtype
is based on k-means clustering while ours are based on spherical k-means clustering. Interestingly, for the CGI sea probes, Mesenchymal and RTK II Classic cannot be distinguishable, and K27 subtype merge some of the samples in RTK I PDGFRA and IDH subtype merges the other samples in RTK I PDGFRA.
The HTML reports for cola analysis can be found at following links: