vignettes/topic1_00_format.Rmd
topic1_00_format.Rmd
Gene sets are represented as a list of genes. There are three types of representations in R.
lt = list(
geneset1 = c("gene1", "gene2", "gene3"),
geneset2 = c("gene2", "gene4"),
geneset3 = c("gene1", "gene3", "gene5", "gene6")
)
lt
## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
df = data.frame(
geneset = c(rep("geneset1", 3), rep("geneset2", 2), rep("geneset3", 4)),
gene = c("gene1", "gene2", "gene3", "gene2", "gene4", "gene1", "gene3", "gene5", "gene6")
)
df
## geneset gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6
Some tools may need genes to be in the first column:
df[, 2:1]
## gene geneset
## 1 gene1 geneset1
## 2 gene2 geneset1
## 3 gene3 geneset1
## 4 gene2 geneset2
## 5 gene4 geneset2
## 6 gene1 geneset3
## 7 gene3 geneset3
## 8 gene5 geneset3
## 9 gene6 geneset3
These two formats can be very easily converted to each other:
data.frame(
geneset = rep(names(lt), times = sapply(lt, length)),
gene = unlist(lt)
)
## geneset gene
## geneset11 geneset1 gene1
## geneset12 geneset1 gene2
## geneset13 geneset1 gene3
## geneset21 geneset2 gene2
## geneset22 geneset2 gene4
## geneset31 geneset3 gene1
## geneset32 geneset3 gene3
## geneset33 geneset3 gene5
## geneset34 geneset3 gene6
split(df$gene, df$geneset)
## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
In the GSEAtraining package, there are two helper functions list_to_dataframe()
and dataframe_to_list()
that do the conversions:
## gene_set gene
## 1 geneset1 gene1
## 2 geneset1 gene2
## 3 geneset1 gene3
## 4 geneset2 gene2
## 5 geneset2 gene4
## 6 geneset3 gene1
## 7 geneset3 gene3
## 8 geneset3 gene5
## 9 geneset3 gene6
## $geneset1
## [1] "gene1" "gene2" "gene3"
##
## $geneset2
## [1] "gene2" "gene4"
##
## $geneset3
## [1] "gene1" "gene3" "gene5" "gene6"
Not very often, the relation between genes and gene sets can be represented as a binary matrix:
m = matrix(0, nrow = 3, ncol = 6)
rownames(m) = unique(df$geneset)
colnames(m) = unique(df$gene)
for(i in seq_len(nrow(df))) {
m[df[i, 1], df[i, 2]] = 1
}
m
## gene1 gene2 gene3 gene4 gene5 gene6
## geneset1 1 1 1 0 0 0
## geneset2 0 1 0 1 0 0
## geneset3 1 0 1 0 1 1
Note when there are more gene sets, there will be a lot of more zeros in m
. In this case, we can use “sparse matrix format” to store the data.