We compare the top \(n\) (\(n\) = 500, 1000, 2000) rows scored by four methods: SD (standard deviation), CV (coefficient of variation), MAD (median absolute deviation) and ATC (ability to correlate to other rows) with five datasets: Golub leukemia dataset, Ritz ALL dataset and TCGA GBM microarray dataset, HSMM single cell RNASeq dataset and MCF10CA single cell RNASeq dataset. For each dataset, we visualize the top rows (or, in other words, the top genes) by heatmaps (where rows are scaled by \(z\)-score method) and Euler diagram.
We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.
Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.
We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.
Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.
We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.
Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.
We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.
Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.
We compared the significant GO terms (Biological Process ontology, FDR < 0.01) for the top genes (only top 1000 genes for simplicity) from the four top-value methods.
Following heatmap visualizes the clustering of the GO terms. The green-white heatmap illustrates whether the GO terms are significant for the gene list extracted by corresponding top-value method. The word cloud annotation visualizes the summary/keywords of the functions in each GO cluster.
We calculate the uniqueness of the top 1000 rows (genes) by the four top-value methods for GDS cohort (206 datasets) and recount2 cohort (223 datasets). The uniqueness for a top-value method is calculated as the fraction of the top 1000 genes that are not in the top 1000 genes by any of the other three methods. E.g., for one dataset, the uniqueness of ATC method is calculated as:
where S_ATC
, S_SD
, S_CV
and S_MAD
are the sets of top 1000 genes under each method.
In the following boxplots (Figure S2.16), we find for the GDS datasets which are microarray datasets, generally, ATC method extracts top genes which are more unique compared to other three methods (mean uniqueness, SD: 0.131, CV: 0.295, MAD: 0.252, ATC: 0.851). For recount2 datasets which are RNASeq datasets, CV method also extracts quite large fraction of unique top genes (mean uniqueness, SD: 0.139, CV: 0.672, MAD: 0.259, ATC: 0.743). Since RNASeq can measure genes which have very low expression (according to the recount2 pipeline), we guess the high fraction of CV-unique top genes is due to the lowly expressed genes in the recount2 datasets (recall CV is defined as the standard deviation dividing the mean where small mean can give large CV value).
Next we check the base mean for the top genes (The base mean is the mean absolute expression level for genes). To make the base mean comparable among datasets, the base mean values are replaced by the corresponding rank normalized by the total number of genes in that dataset, (calculated as rank(base_mean)/length(base_mean)
). For each top-value method in each dataset, the mean rank for the top 1000 genes is used to measure the average base expression level.
In the following boxplots in Figure S2.17, we see very clearly that in recount2 datasets, the top 1000 genes by CV method have much lower expression than the top 1000 genes by other top-value methods.
Figure S2.18 visualizes the average pair-wise overlap among datasets where we can also see the uniqueness of ATC. For \(n\) datasets in a cohort, the value of the mean overlap between method \(i\) and \(j\) is defined as:
\[\frac{1}{n}\sum_k^n{p_{i,j,k}}\]
where \(p_{ijk}\) is the overlap between method \(i\) and \(j\) in dataset \(k\) and is defined as:
\[p_{i,j,k} = \frac{| S_{i,k} \bigcap S_{j,k} |}{1000}\]
where \(S_{ik}\) and \(S_{jk}\) are the sets of top 1000 genes extracted by method \(i\) and \(j\) in dataset \(k\).