Supplementary file S01. Definition of gene overlap coefficients

In this supplementary, we give a brief introduction of following coefficients that measure the similarity between two gene sets based on gene overlaps: Jaccard coefficient, Dice coefficient, overlap coefficient and kappa coefficient.

Denote two sets of genes as \(A\) and \(B\), the Jaccard coeffcient between the two sets is calculated as:

\[{|A \cap B|}\over{|A \cup B|}\]

The Dice coeffcient is calculated as:

\[\frac{|A \cap B|}{(|A| + |B|)/2}\]

Jaccard coefficient has a relationship with Dice coefficient in the form of:

\[Jaccard = \frac{Dice}{2 - Dice}\]

The overlap coeffcient is calculated as:

\[\frac{| A \cap B | }{\min(|A|,|B|)}\]

The symbol \(|A|\) means the size of set \(A\) (number of elements).

The definition of kappa coeffient is a little bit complex. First let’s format the two sets into a contigency table:

		In set B
		Yes	No
In set A	Yes	a	b
In set A	No	c	d

where \(a\), \(b\), \(c\), \(d\) are the numbers of genes that fall in each category.

Let’s calculate \(p_o\) and \(p_e\) as:

\[p_o = \frac{a+d}{a+b+c+d}\]

\[p_{Yes} = \frac{a+b}{a+b+c+d} \cdot \frac{a+c}{a+b+c+d}\]

\[p_{No} = \frac{c+d}{a+b+c+d} \cdot \frac{b+d}{a+b+c+d}\]

\[p_e = p_{Yes} + p_{No}\]

where \(p_o\) is the probability of a gene in both gene sets or neither in the two sets, \(p_{Yes}\) is the probability of a gene in both gene sets by random (by assuming the events of a gene in set \(A\) and set \(B\) are independent), \(p_{No}\) is the probability of a gene not in the two sets by random, and \(p_e\) is the probability of a gene either both in the two sets or not in the two sets by random.

and the kappa coeffcient is calculated as:

\[\frac{p_o - p_e}{1 - p_e}\]

Supplementary file S01. Definition of gene overlap coefficients

Zuguang Gu (z.gu@dkfz.de)

2021-11-21