For row \(i\) in a matrix, denote the variable \(X\) as a vector of absolute values of the correlation coefficients to all other rows, evaluated at \(x\), then the ATC score for row \(i\) is defined as:
\[ ATC_i = 1 - \int_0^1F_X(x)dx \]
where \(F_X(x)\) is the cumulative distribution function (CDF) of \(X\).
Figure S1.1 illustrates the empirical CDF (eCDF) curve of \(X\) for a certain row \(i\). The ATC score corresponds to the red area above the eCDF curve. It can be imagined that when row \(i\) correlates stronger with other rows (both correlation and anti-correlation), the eCDF curve shifts more to the right, thus with higher ATC scores.
A simulation test is performed to demonstrate the attributes of the ATC method. A matrix with 160 rows, 100 columns with random values drawn from a multivariate normal distribution is generated. The 160 rows are configured as follows:
The first plot in Figure S1.2 is the heatmap for the random matrix, split by the three groups of rows. In the second plot, they are eCDF curves of the correlation of the 160 rows. The third plot is the ATC scores for all 160 rows and the fourth plot is the standard deviation for the 160 rows. Different colors represent different row groups.
All the 160 rows have a similar variance of 1 that they cannot be distinguished by using variance. As a contrast, the rows with non-zero covariance have higher ATC scores (the red and green groups). The ATC scores are even higher when the number of correlated rows increases, although the correlation value itself is relatively small (the green group). This shows ATC method can assign higher scores for rows which correlate to more other rows.
Since the eCDF curve monotonically increases from 0 to 1, for intervals with fixed width, they contribute more to the ATC scores if they are close to 0 (Figure S1.3).
There can be scenarios when a huge number of rows correlate to each other but only with very small correlation values. It results in small right-shift of the eCDF curves for these rows, compared to the scenario where rows are completely uncorrelated. Since the correlation values are close to zero, these small shifts of the eCDF curves results in a relative large increase of ATC scores.
To decrease such effect, the ATC definition can be modified to: For row \(i\) in the matrix, denote the variable \(X\) as a vector of absolute values of the correlation coefficients to all other rows, evaluated at \(x\), and denote \(Y = X^\beta\), evaluated at \(y\), then the ATC score for row \(i\) is defined as:
\[ ATC_i = (1-\alpha) - \int_\alpha^1F_Y(y)dy \]
Where \(F_Y(y)\) is the CDF of \(Y\). Now \(ATC_i\) is the red area above eCDF curve only on the right of \(y =\alpha\). The coefficient \(\beta\) is the power added to the absolute correlations that it decreases more for the smaller correlations. By Default \(\alpha\) is set to 0 and \(\beta\) is set to 1.
We slightly change the previous simulation test that in the first group, we set 500 rows with pairwise correlation to 0.1, which generates 560 rows. Similarly, there are four plots illustrated in Figure S1.4.
In Figure S1.4, we can see group 1 (the black dots), since there are quite a lot of rows, they gain high ATC scores even when they only have tiny correlation values (compare to the rows in group 2, the red dots).
To remove the effect of small correlation, we can set a value larger than 0 to \(\alpha\), e.g. 0.3:
In Figure S1.5, now we see the ATC scores for rows in group 1 decrease.
We can also set \(\beta\), e.g. 3, to decrease the value for rows in group 1, as shown in Figure S1.6.