Compare Top-value Methods

Golub leukemia dataset

top n = 500

Figure S2.1A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

Figure S2.2A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

top n = 1000

Figure S2.1B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

Figure S2.2B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

top n = 2000

Figure S2.1C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

Figure S2.2C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Golub leukemia dataset.

We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.

Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.

Figure S2.3. Heatmap of GO clusters and the summaries, Golub leukemia dataset.

Ritz ALL dataset

top n = 500

Figure S2.4A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

Figure S2.5A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

top n = 1000

Figure S2.4B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

Figure S2.5B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

top n = 2000

Figure S2.4C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

Figure S2.5C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, Ritz ALL dataset.

We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.

Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.

Figure S2.6. Heatmap of GO clusters and the summaries, Ritz ALL dataset.

TCGA GBM microarray dataset

top n = 500

Figure S2.7A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

Figure S2.8A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

top n = 1000

Figure S2.7B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

Figure S2.8B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

top n = 2000

Figure S2.7C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

Figure S2.8C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, TCGA GBM microarray dataset.

We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.

Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.

Figure S2.9. Heatmap of GO clusters and the summaries, TCGA GBM microarray dataset.

HSMM single cell RNASeq dataset

top n = 500

Figure S2.10A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

Figure S2.11A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

top n = 1000

Figure S2.10B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

Figure S2.11B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

top n = 2000

Figure S2.10C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

Figure S2.11C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, HSMM single cell RNASeq dataset.

We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.

Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.

Figure S2.12. Heatmap of GO clusters and the summaries, HSMM single cell RNASeq dataset.

MCF10CA single cell RNASeq dataset

top n = 500

Figure S2.13A. Heatmap of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

Figure S2.14A. Euler diagram of the top 500 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

top n = 1000

Figure S2.13B. Heatmap of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

Figure S2.14B. Euler diagram of the top 1000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

top n = 2000

Figure S2.13C. Heatmap of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

Figure S2.14C. Euler diagram of the top 2000 genes with the highest SD, CV, MAD and ATC scores, MCF10CA single cell RNASeq dataset.

We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.

Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.

Figure S2.15. Heatmap of GO clusters and the summaries, MCF10CA single cell RNASeq dataset.

Uniqueness of the top rows

We calculate the uniqueness of the top 1000 rows (genes) by the four top-value methods for GDS cohort (206 datasets) and recount2 cohort (223 datasets). The uniqueness for a top-value method is calculated as the fraction of the top 1000 genes that are not in the top 1000 genes by any of the other three methods. E.g., for one dataset, the uniqueness of ATC method is calculated as:

where S_ATC, S_SD, S_CV and S_MAD are the sets of top 1000 genes under each method.

In the following boxplots (Figure S2.16), we find for the GDS datasets which are microarray datasets, generally, ATC method extracts top genes which are more unique compared to other three methods (mean uniqueness, SD: 0.131, CV: 0.295, MAD: 0.252, ATC: 0.851). For recount2 datasets which are RNASeq datasets, CV method also extracts quite large fraction of unique top genes (mean uniqueness, SD: 0.139, CV: 0.672, MAD: 0.259, ATC: 0.743). Since RNASeq can measure genes which have very low expression (according to the recount2 pipeline), we guess the high fraction of CV-unique top genes is due to the lowly expressed genes in the recount2 datasets (recall CV is defined as the standard deviation dividing the mean where small mean can give large CV value).

Figure S2.16. Uniqueness of top-value methods in GDS datasets and recount2 datasets.

Next we check the base mean for the top genes (The base mean is the mean absolute expression level for genes). To make the base mean comparable among datasets, the base mean values are replaced by the corresponding rank normalized by the total number of genes in that dataset, (calculated as rank(base_mean)/length(base_mean)). For each top-value method in each dataset, the mean rank for the top 1000 genes is used to measure the average base expression level.

In the following boxplots in Figure S2.17, we see very clearly that in recount2 datasets, the top 1000 genes by CV method have much lower expression than the top 1000 genes by other top-value methods.

Figure S2.17. Base expression level of top 1000 genes in GDS and recount2 datasets.

Figure S2.18 visualizes the average pair-wise overlap among datasets where we can also see the uniqueness of ATC. For \(n\) datasets in a cohort, the value of the mean overlap between method \(i\) and \(j\) is defined as:

\[\frac{1}{n}\sum_k^n{p_{i,j,k}}\]

where \(p_{ijk}\) is the overlap between method \(i\) and \(j\) in dataset \(k\) and is defined as:

\[p_{i,j,k} = \frac{| S_{i,k} \bigcap S_{j,k} |}{1000}\]

where \(S_{ik}\) and \(S_{jk}\) are the sets of top 1000 genes extracted by method \(i\) and \(j\) in dataset \(k\).

Figure S2.18. The average pair-wise overlap among datasets.

Compare Top-value Methods

Zuguang Gu (z.gu@dkfz.de)

2020-09-01

Golub leukemia dataset

top n = 500

top n = 1000

top n = 2000

Ritz ALL dataset

top n = 500

top n = 1000

top n = 2000

TCGA GBM microarray dataset

top n = 500

top n = 1000

top n = 2000

HSMM single cell RNASeq dataset

top n = 500

top n = 1000

top n = 2000

MCF10CA single cell RNASeq dataset

top n = 500

top n = 1000

top n = 2000

Uniqueness of the top rows