Topic 1-00: Data representation of gene sets in R
Zuguang Gu z.gu@dkfz.de
2025-05-31
Source:vignettes/topic1_00_format.Rmd
topic1_00_format.RmdGene sets are represented as a list of genes. There are three types of representations in R.
- A list of gene vectors
lt = list(
geneset1 = c("gene1", "gene2", "gene3"),
geneset2 = c("gene2", "gene4"),
geneset3 = c("gene1", "gene3", "gene5", "gene6")
)
lt## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
- A two-column data frame
df = data.frame(
geneset = c(rep("geneset1", 3), rep("geneset2", 2), rep("geneset3", 4)),
gene = c("gene1", "gene2", "gene3", "gene2", "gene4", "gene1", "gene3", "gene5", "gene6")
)
df## geneset gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6
Some tools may need genes to be in the first column:
df[, 2:1]## gene geneset
## 1 gene1 geneset1
## 2 gene2 geneset1
## 3 gene3 geneset1
## 4 gene2 geneset2
## 5 gene4 geneset2
## 6 gene1 geneset3
## 7 gene3 geneset3
## 8 gene5 geneset3
## 9 gene6 geneset3
These two formats can be very easily converted to each other:
- list to data frame
data.frame(
geneset = rep(names(lt), times = sapply(lt, length)),
gene = unlist(lt)
)## geneset gene
## geneset11 geneset1 gene1
## geneset12 geneset1 gene2
## geneset13 geneset1 gene3
## geneset21 geneset2 gene2
## geneset22 geneset2 gene4
## geneset31 geneset3 gene1
## geneset32 geneset3 gene3
## geneset33 geneset3 gene5
## geneset34 geneset3 gene6
- data frame to list
split(df$gene, df$geneset)## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
In the GSAEtopics package, there are two helper
functions list_to_data_frame() and
data_frame_to_list() that do the conversions:
## gene_set gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6
data_frame_to_list() automatically guesses which columns
are genesets and which columns are genes.
## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
- a binary matrix
Not very often, the relation between genes and gene sets can be represented as a binary matrix:
m = matrix(0, nrow = 3, ncol = 6)
rownames(m) = unique(df$geneset)
colnames(m) = unique(df$gene)
for(i in seq_len(nrow(df))) {
m[df[i, 1], df[i, 2]] = 1
}
m## gene1 gene2 gene3 gene4 gene5 gene6
## geneset1 1 1 1 0 0 0
## geneset2 0 1 0 1 0 0
## geneset3 1 0 1 0 1 1
It might be more useful when genes have weights:
## gene1 gene2 gene3 gene4 gene5 gene6
## geneset1 0.08075014 0.8343330 0.6007609 0.000000000 0.0000000 0.000000
## geneset2 0.00000000 0.1572084 0.0000000 0.007399441 0.0000000 0.000000
## geneset3 0.46639350 0.0000000 0.4977774 0.000000000 0.2897672 0.732882
Note when there are more gene sets, there will be a lot of zeros in
m. In this case, we can use “sparse matrix format” to store
the data.