Gene sets are represented as a list of genes. There are three types of representations in R.

  1. A list of gene vectors
lt = list(
    geneset1 = c("gene1", "gene2", "gene3"),
    geneset2 = c("gene2", "gene4"),
    geneset3 = c("gene1", "gene3", "gene5", "gene6")
)
lt
## $geneset1
## [1] "gene1" "gene2" "gene3"
## 
## $geneset2
## [1] "gene2" "gene4"
## 
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
  1. A two-column data frame
df = data.frame(
    geneset = c(rep("geneset1", 3), rep("geneset2", 2), rep("geneset3", 4)),
    gene = c("gene1", "gene2", "gene3", "gene2", "gene4", "gene1", "gene3", "gene5", "gene6")
)
df
##    geneset  gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6

Some tools may need genes to be in the first column:

df[, 2:1]
##    gene  geneset
## 1 gene1 geneset1
## 2 gene2 geneset1
## 3 gene3 geneset1
## 4 gene2 geneset2
## 5 gene4 geneset2
## 6 gene1 geneset3
## 7 gene3 geneset3
## 8 gene5 geneset3
## 9 gene6 geneset3

These two formats can be very easily converted to each other:

  • list to data frame
data.frame(
    geneset = rep(names(lt), times = sapply(lt, length)),
    gene = unlist(lt)
)
##            geneset  gene
## geneset11 geneset1 gene1
## geneset12 geneset1 gene2
## geneset13 geneset1 gene3
## geneset21 geneset2 gene2
## geneset22 geneset2 gene4
## geneset31 geneset3 gene1
## geneset32 geneset3 gene3
## geneset33 geneset3 gene5
## geneset34 geneset3 gene6
  • data frame to list
split(df$gene, df$geneset)
## $geneset1
## [1] "gene1" "gene2" "gene3"
## 
## $geneset2
## [1] "gene2" "gene4"
## 
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"

In the GSEAtraining package, there are two helper functions list_to_dataframe() and dataframe_to_list() that do the conversions:

##   gene_set  gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6
## $geneset1
## [1] "gene1" "gene2" "gene3"
## 
## $geneset2
## [1] "gene2" "gene4"
## 
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
  1. a binary matrix

Not very often, the relation between genes and gene sets can be represented as a binary matrix:

m = matrix(0, nrow = 3, ncol = 6)
rownames(m) = unique(df$geneset)
colnames(m) = unique(df$gene)

for(i in seq_len(nrow(df))) {
    m[df[i, 1], df[i, 2]] = 1
}
m
##          gene1 gene2 gene3 gene4 gene5 gene6
## geneset1     1     1     1     0     0     0
## geneset2     0     1     0     1     0     0
## geneset3     1     0     1     0     1     1

Note when there are more gene sets, there will be a lot of more zeros in m. In this case, we can use “sparse matrix format” to store the data.