Chapter 8 UpSet plot

UpSet plot provides an efficient way to visualize intersections of multiple sets compared to the traditional approaches, i.e. the Venn Diagram. It is implemented in the UpSetR package in R. Here we re-implemented UpSet plots with the ComplexHeatmap package with some improvements.

8.1 Input data

To represent multiple sets, the variable can be represented as:

  1. A list of sets where each set is a vector, e.g.:
list(set1 = c("a", "b", "c"),
     set2 = c("b", "c", "d", "e"),
     ...)
  1. A binary matrix/data frame where rows are elements and columns are sets, e.g.:
  set1 set2 set3
h    1    1    1
t    1    0    1
j    1    0    0
u    1    0    1
w    1    0    0
...

E.g., for row t, it means, t is in set set1, not in set set2, and in set set3. Note the matrix is also valid if it is a logical matrix.

If the variable is a data frame, the binary columns (only contain 0 and 1) and the logical columns are only used.

Both formats can be used for making UpSet plots, users can still use list_to_matrix() to convert from list to the binary matrix.

lt = list(set1 = c("a", "b", "c"),
          set2 = c("b", "c", "d", "e"))
list_to_matrix(lt)
##   set1 set2
## a    1    0
## b    1    1
## c    1    1
## d    0    1
## e    0    1

You can also set the universal set in list_to_matrix():

list_to_matrix(lt, universal = letters[1:10])
##   set1 set2
## a    1    0
## b    1    1
## c    1    1
## d    0    1
## e    0    1
## f    0    0
## g    0    0
## h    0    0
## i    0    0
## j    0    0
  1. The set can be genomic intervals, then it can only be represented as a list of GRanges/IRanges objects.
list(set1 = GRanges(...),
     set2 = GRanges(...),
     ...)

8.2 Mode

E.g. for three sets (A, B, C), all combinations of selecting elements in the set or not in the set are as following:

A B C
1 1 1
1 1 0
1 0 1
0 1 1
1 0 0
0 1 0
0 0 1

A value of 1 means to select that set and 0 means not to select that set. E.g., “1 1 0” means to select set A, B while not set C. Note there is no “0 0 0”, because the background set is not of interest here. In following part of this section, we refer A, B and C as sets and each combination as combination set. The whole binary matrix is called combination matrix.

The UpSet plot visualizes the size of each combination set. With the binary code of each combination set, next we need to define how to calculate the size of that combination set. There are three modes:

  1. distinct mode: 1 means in that set and 0 means not in that set, then 1 1 0 means a set of elements both in set A and B, while not in C (setdiff(intersect(A, B), C)). Under this mode, the seven combination sets are the seven partitions in the Venn diagram and they are mutually exclusive.

  2. intersect mode: 1 means in that set and 0 is not taken into account, then, 1 1 0 means a set of elements in set A and B, and they can also in C or not in C (intersect(A, B)). Under this mode, the seven combination sets can overlap.

  3. union mode: 1 means in that set and 0 is not taken into account. When there are multiple 1, the relationship is OR. Then, 1 1 0 means a set of elements in set A or B, and they can also in C or not in C (union(A, B)). Under this mode, the seven combination sets can overlap.

The three modes are illustrated in following figure:

8.3 Make the combination matrix

The function make_comb_mat() generates the combination matrix as well as calculates the size of the sets and the combination sets. The input can be one single variable or name-value pairs:

set.seed(123)
lt = list(a = sample(letters, 5),
          b = sample(letters, 10),
          c = sample(letters, 15))
m1 = make_comb_mat(lt)
m1
## A combination matrix with 3 sets and 7 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.
m2 = make_comb_mat(a = lt$a, b = lt$b, c = lt$c)
m3 = make_comb_mat(list_to_matrix(lt))

m1, m2 and m3 are identical.

The mode is controlled by the mode argument:

m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")

The UpSet plots under different modes will be demonstrated in later sections.

When there are too many sets, the sets can be pre-filtered by the set sizes. The min_set_size and top_n_sets are for this purpose. min_set_size controls the minimal size for the sets and top_n_sets controls the number of top sets with largest sizes.

m1 = make_comb_mat(lt, min_set_size = 4)
m2 = make_comb_mat(lt, top_n_sets = 2)

The subsetting of the sets affects the calculation of the sizes of the combination sets, that is why it needs to be controlled at the combination matrix generation step. The subsetting of combination sets can be directly performed by subsetting the matrix:

m = make_comb_mat(lt)
m[1:4]
## A combination matrix with 3 sets and 4 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

make_comb_mat() also allows to specify the universal set so that the complement set which contains elements not belonging to any set is also considered.

m = make_comb_mat(lt, universal_set = letters)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
##   universal set is set.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

The universal set can be smaller than the union of all sets, then for each set, only the intersection to universal set is considered.

m = make_comb_mat(lt, universal_set = letters[1:10])
m
## A combination matrix with 3 sets and 5 combinations.
##   ranges of combination set size: c(1, 4).
##   mode for the combination size: distinct.
##   sets are on rows.
##   universal set is set.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

If you already know the size of the complement size, you can directly set complement_size argument.

m = make_comb_mat(lt, complement_size = 5)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

When the input is matrix and it contains elements that do not belong to any of the set, these elements are treated as complement set.

x = list_to_matrix(lt, universal_set = letters)
m = make_comb_mat(x)
m
## A combination matrix with 3 sets and 8 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
##   universal set is set.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

The universal set also works for sets as genomic regions.

8.4 Utility functions

make_comb_mat() returns a matrix, also in comb_mat class. There are some utility functions that can be applied to this comb_mat object:

  • set_name(): The set names.
  • comb_name(): The combination set names. The names of the combination sets are formatted as a string of binary bits. E.g. for three sets of A, B, C, the combination set with name “101” corresponds to selecting set A, not selecting set B and selecting set C.
  • set_size(): The set sizes.
  • comb_size(): The combination set sizes.
  • comb_degree(): The degree for a combination set is the number of sets that are selected.
  • t(): Transpose the combination matrix. By default make_comb_mat() generates a matrix where sets are on rows and combination sets are on columns, and so are they on the UpSet plots. By transposing the combination matrix, the position of sets and combination sets can be swtiched on the UpSet plot.
  • extract_comb(): Extract the elements in a specified combination set. The usage will be explained later.

Quick examples are:

m = make_comb_mat(lt)
set_name(m)
## [1] "a" "b" "c"
comb_name(m)
## [1] "100" "010" "001" "110" "101" "011" "111"
set_size(m)
##  a  b  c 
##  5 10 15
comb_size(m)
## 100 010 001 110 101 011 111 
##   1   2   8   2   1   5   1
comb_degree(m)
## 100 010 001 110 101 011 111 
##   1   1   1   2   2   2   3
t(m)
## A combination matrix with 3 sets and 7 combinations.
##   ranges of combination set size: c(1, 8).
##   mode for the combination size: distinct.
##   sets are on columns
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

For using extract_comb(), the valid combination set name should be from comb_name(). Note the elements in the combination sets depends on the “mode” set in make_comb_mat().

extract_comb(m, "101")
## [1] "h"

Next we demonstrate a second example, where the sets are genomic regions. When the sets are genomic regions, the size is calculated as the sum of the width of regions in each set (or in other words, the total number of base pairs).

library(circlize)
library(GenomicRanges)
lt2 = lapply(1:4, function(i) generateRandomBed())
lt2 = lapply(lt2, function(df) GRanges(seqnames = df[, 1], 
    ranges = IRanges(df[, 2], df[, 3])))
names(lt2) = letters[1:4]
m = make_comb_mat(lt2)
set_size(m)
##          a          b          c          d 
## 1547359832 1553032120 1553276284 1536074343
comb_size(m)
##      1000      0100      0010      0001      1100      1010      1001 
## 194662048 193835937 195511630 189629247 195376990 193348833 191061005 
##      0110      0101      0011      1110      1101      1011      0111 
## 194172744 190845714 194009085 196733799 191029099 188462356 194352135 
##      1111 
## 196685702

And now extract_comb() returns genomic regions that are in the corresponding combination set.

extract_comb(m, "1010")
## GRanges object with 4964 ranges and 0 metadata columns:
##          seqnames               ranges strand
##             <Rle>            <IRanges>  <Rle>
##      [1]     chr1   [ 115989,  119463]      *
##      [2]     chr1   [ 840522,  865179]      *
##      [3]     chr1   [1179600, 1204260]      *
##      [4]     chr1   [2593536, 2608935]      *
##      [5]     chr1   [2891740, 2909649]      *
##      ...      ...                  ...    ...
##   [4960]     chrY [54028033, 54047795]      *
##   [4961]     chrY [54524877, 54539441]      *
##   [4962]     chrY [54736433, 54827603]      *
##   [4963]     chrY [57875829, 57921061]      *
##   [4964]     chrY [59054832, 59056095]      *
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

With comb_size() and comb_degree(), we can filter the combination matrix as:

m = make_comb_mat(lt)
# combination set size >= 4
m[comb_size(m) >= 4]
## A combination matrix with 3 sets and 2 combinations.
##   ranges of combination set size: c(5, 8).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.
# combination set degree == 2
m[comb_degree(m) == 2]
## A combination matrix with 3 sets and 3 combinations.
##   ranges of combination set size: c(1, 5).
##   mode for the combination size: distinct.
##   sets are on rows.
## 
## Utility functions that can be applied:
## - set_name(): name of the sets.
## - set_size(): size of the sets.
## - comb_name(): name of the combination sets.
## - comb_size(): size of the combination sets.
## - comb_degree(): degree of the combination sets.
## - extract_comb(): extract elements in the specific combination set.
## - t(): transpose the combination matrix on the UpSet plot.
## - '[': subset the combination matrix.

For the complement set, the name for this special combination set is only composed of zeros.

m2 = make_comb_mat(lt, universal_set = letters)
comb_name(m2) # see the first element
## [1] "000" "100" "010" "001" "110" "101" "011" "111"
comb_degree(m2)
## 000 100 010 001 110 101 011 111 
##   0   1   1   1   2   2   2   3

If universal_set was set in make_comb_mat(), extract_comb() can be applied to the complement set.

m2 = make_comb_mat(lt, universal_set = letters)
extract_comb(m2, "000")
## [1] "a" "c" "e" "f" "p" "y"
m2 = make_comb_mat(lt, universal_set = letters[1:10])
extract_comb(m2, "000")
## [1] "a" "c" "e" "f"

When universal_set was set, extract_comb() also works for genomic region sets.

8.5 Make the plot

Making the UpSet plot is very straightforward that users just send the combination matrix to UpSet() function:

UpSet(m)

By default the sets are ordered by the size and the combination sets are ordered by the degree (number of sets that are selected).

The order is controlled by set_order and comb_order:

UpSet(m, set_order = c("a", "b", "c"), comb_order = order(comb_size(m)))

Color of dots, size of dots and line width of the segments are controlled by pt_size, comb_col and lwd. comb_col should be a vector corresponding to the combination sets. In following code, since comb_degree(m) returns a vector of integers, we just use it as index for the color vector.

UpSet(m, pt_size = unit(5, "mm"), lwd = 3,
    comb_col = c("red", "blue", "black")[comb_degree(m)])

Colors for the background (the rectangles and the dots representing the set is not selected) are controlled by bg_col, bg_pt_col. The length of bg_col can have length of one or two.

UpSet(m, comb_col = "#0000FF", bg_col = "#F0F0FF", bg_pt_col = "#CCCCFF")

UpSet(m, comb_col = "#0000FF", bg_col = c("#F0F0FF", "#FFF0F0"), bg_pt_col = "#CCCCFF")

Transposing the combination matrix swtiches the sets to columns and combination sets to rows.

UpSet(t(m))

As we have introduced, if do subsetting on the combination sets, the subset of the matrix can be visualized as well:

UpSet(m[comb_size(m) >= 4])
UpSet(m[comb_degree(m) == 2])

Following compares the different mode in make_comb_mat():

m1 = make_comb_mat(lt) # the default mode is `distinct`
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1)
UpSet(m2)
UpSet(m3)

For the plot containing complement set, there is one additional column showing this complement set does not overlap to any of the sets (all dots are in grey).

m2 = make_comb_mat(lt, universal_set = letters)
UpSet(m2)

If you already know the size for the complement set, you can directly assign it by complement_size argument in make_comb_mat().

m2 = make_comb_mat(lt, complement_size = 10)
UpSet(m2)

For the case where the universal set is smaller than the union of all sets.

m2 = make_comb_mat(lt, universal_set = letters[1:10])
UpSet(m2)

By default the empty combination sets are removed from the plot, but they can be kept by setting remove_empty_comb_set = FALSE in make_comb_mat().

m2 = make_comb_mat(lt, universal_set = letters[1:10], remove_empty_comb_set = FALSE)
UpSet(m2)

There are some cases that you may have complement set but you don’t want to show it, especially when the input for make_comb_mat() is a matrix which already contains complement set.

x = list_to_matrix(lt, universal_set = letters)
m2 = make_comb_mat(x)
UpSet(m2)

m2 = make_comb_mat(x, remove_complement_set = TRUE)
UpSet(m2)

Setting remove_complement_set = TRUE is identical to:

m2 = make_comb_mat(x)
m2 = m2[comb_degree(m2) > 0]

8.6 UpSet plots as heatmaps

In the UpSet plot, the major component is the combination matrix, and on the two sides are the barplots representing the size of sets and the combination sets, thus, it is quite straightforward to implement it as a “heatmap” where the heatmap is self-defined with dots and segments, and the two barplots are two barplot annotations constructed by anno_barplot().

The default top annotation is:

HeatmapAnnotation("Intersection\nsize" = anno_barplot(comb_size(m), 
        border = FALSE, gp = gpar(fill = "black"), height = unit(3, "cm")), 
    annotation_name_side = "left", annotation_name_rot = 0)

This top annotation is wrapped in upset_top_annotation() which only contais the upset top barplot annotation. Most of the arguments in upset_top_annotation() directly goes to the anno_barplot(), e.g. to set the colors of bars:

UpSet(m, top_annotation = upset_top_annotation(m, 
    gp = gpar(col = comb_degree(m))))

To control the data range and axis:

UpSet(m, top_annotation = upset_top_annotation(m, 
    ylim = c(0, 15),
    bar_width = 1,
    axis_param = list(side = "right", at = c(0, 5, 10, 15),
        labels = c("zero", "five", "ten", "fifteen"))))

To control the annotation name:

UpSet(m, top_annotation = upset_top_annotation(m, 
    annotation_name_rot = 90,
    annotation_name_side = "right",
    axis_param = list(side = "right")))

The settings are very similar for the right annotation:

UpSet(m, right_annotation = upset_right_annotation(m, 
    ylim = c(0, 30),
    gp = gpar(fill = "green"),
    annotation_name_side = "top",
    axis_param = list(side = "top")))

upset_top_annotation() and upset_right_annotation() can automatically recognize whether sets are on rows or columns.

upset_top_annotation() and upset_right_annotation() only contain one barplot annotation. If users want to add more annotations, they need to manually construct a HeatmapAnnotation object with multiple annotations.

To add more annotations on top:

UpSet(m, top_annotation = HeatmapAnnotation(
    degree = as.character(comb_degree(m)),
    "Intersection\nsize" = anno_barplot(comb_size(m), 
        border = FALSE, 
        gp = gpar(fill = "black"), 
        height = unit(2, "cm")
    ), 
    annotation_name_side = "left", 
    annotation_name_rot = 0))

To add more annotation on the right:

UpSet(m, right_annotation = rowAnnotation(
    "Set size" = anno_barplot(set_size(m), 
        border = FALSE, 
        gp = gpar(fill = "black"), 
        width = unit(2, "cm")
    ),
    group = c("group1", "group1", "group2")))

To move the right annotation to the left of the combination matrix:

UpSet(m, left_annotation = rowAnnotation(
    "Set size" = anno_barplot(set_size(m), 
        border = FALSE, 
        gp = gpar(fill = "black"), 
        width = unit(2, "cm")
    )), right_annotation = NULL)

To reverse the axis of the left annotation:

UpSet(m, left_annotation = rowAnnotation(
    "Set size" = anno_barplot(set_size(m), 
        axis_param = list(direction = "reverse"),
        border = FALSE, 
        gp = gpar(fill = "black"), 
        width = unit(2, "cm")
    )), right_annotation = NULL,
    row_names_side = "right")

The object returned by UpSet() is actually a Heatmap class object, thus, you can add to other heatmaps and annotations by + or %v%.

ht = UpSet(m)
class(ht)
## [1] "Heatmap"
## attr(,"package")
## [1] "ComplexHeatmap"
ht + Heatmap(1:3, name = "foo", width = unit(5, "mm")) + 
    rowAnnotation(bar = anno_points(1:3))

ht %v% Heatmap(rbind(1:7), name = "foo", row_names_side = "left", 
        height = unit(5, "mm")) %v% 
    HeatmapAnnotation(bar = anno_points(1:7),
        annotation_name_side = "left")

Add multiple UpSet plots:

m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(m1, row_title = "distinct mode") %v%
    UpSet(m2, row_title = "intersect mode") %v%
    UpSet(m3, row_title = "union mode")

Or first transpose all the combination matrices and add them horizontally:

m1 = make_comb_mat(lt, mode = "distinct")
m2 = make_comb_mat(lt, mode = "intersect")
m3 = make_comb_mat(lt, mode = "union")
UpSet(t(m1), column_title = "distinct mode") +
    UpSet(t(m2), column_title = "intersect mode") +
    UpSet(t(m3), column_title = "union mode")

The three combination matrices are actually the same and plotting them three times is redundant. With the functionality in ComplexHeatmap package, we can use other types of annotations.

# it is the same with using m1, m2 or m3
ht = UpSet(m1, top_annotation = HeatmapAnnotation(size = anno_lines(
        cbind(comb_size(m1), comb_size(m2), comb_size(m3)),
        gp = gpar(col = 2:4), height = unit(3, "cm")
)))
# you need to manually construct a legend
draw(ht, annotation_legend_list = list(Legend(
        title = "mode",
        type = "lines",
        labels = c("distinct", "intersect", "union"),
        legend_gp = gpar(col = 2:4)
    ))
)

8.7 Example with the movies dataset

UpsetR package also provides a movies dataset, which contains 17 genres for 3883 movies. First load the dataset.

movies = read.csv(system.file("extdata", "movies.csv", package = "UpSetR"), 
    header = TRUE, sep = ";")
head(movies)
##                                 Name ReleaseDate Action Adventure Children
## 1                   Toy Story (1995)        1995      0         0        1
## 2                     Jumanji (1995)        1995      0         1        1
## 3            Grumpier Old Men (1995)        1995      0         0        0
## 4           Waiting to Exhale (1995)        1995      0         0        0
## 5 Father of the Bride Part II (1995)        1995      0         0        0
## 6                        Heat (1995)        1995      1         0        0
##   Comedy Crime Documentary Drama Fantasy Noir Horror Musical Mystery
## 1      1     0           0     0       0    0      0       0       0
## 2      0     0           0     0       1    0      0       0       0
## 3      1     0           0     0       0    0      0       0       0
## 4      1     0           0     1       0    0      0       0       0
## 5      1     0           0     0       0    0      0       0       0
## 6      0     1           0     0       0    0      0       0       0
##   Romance SciFi Thriller War Western AvgRating Watches
## 1       0     0        0   0       0      4.15    2077
## 2       0     0        0   0       0      3.20     701
## 3       1     0        0   0       0      3.02     478
## 4       0     0        0   0       0      2.73     170
## 5       0     0        0   0       0      3.01     296
## 6       0     0        1   0       0      3.88     940

To make a same UpSet plot as in this vignette:

m = make_comb_mat(movies, top_n_sets = 6, remove_complement_set = TRUE)
UpSet(m)

Following code makes it look more similar as the orignal plot. The code is a little bit long, but most of the code mainly customize the annotations and row/column orders.

m = make_comb_mat(movies, top_n_sets = 6, remove_complement_set = TRUE)
ss = set_size(m)
UpSet(m, 
    set_order = order(set_size(m)),
    comb_order = order(comb_degree(m), -comb_size(m)),
    top_annotation = HeatmapAnnotation(
        "Genre Intersections" = anno_barplot(comb_size(m), 
            border = FALSE, 
            gp = gpar(fill = "black"), 
            height = unit(4, "cm")
        ), 
        annotation_name_side = "left", 
        annotation_name_rot = 90),
    left_annotation = rowAnnotation(
        "Movies Per Genre" = anno_barplot(-ss, 
            baseline = 0,
            axis_param = list(
                at = c(0, -500, -1000, -1500),
                labels = c(0, 500, 1000, 1500),
                labels_rot = 0),
            border = FALSE, 
            gp = gpar(fill = "black"), 
            width = unit(4, "cm")
        ),
        set_name = anno_text(set_name(m), 
            location = 0.5, 
            just = "center",
            width = max_text_width(set_name(m)) + unit(4, "mm"))
    ), 
    right_annotation = NULL,
    show_row_names = FALSE)

In movies dataset, there is also one column AvgRatinh which gives the rating of each movie, we next split all the movies into five groups based on the ratings.

genre = c("Action", "Romance", "Horror", "Children", "SciFi", "Documentary")
rating = cut(movies$AvgRating, c(0, 1, 2, 3, 4, 5))
m_list = tapply(seq_len(nrow(movies)), rating, function(ind) {
    make_comb_mat(movies[ind, genre, drop = FALSE], remove_complement_set = TRUE)
})

The combination matrices in m_list might have different combination sets:

sapply(m_list, comb_size)
## $`(0,1]`
## 010000 001000 000100 000001 
##      1      2      1      1 
## 
## $`(1,2]`
## 100000 010000 001000 110000 000100 101000 000010 100100 000001 100010 
##     14      7     38      1     14      4      3      5      2      5 
## 001010 101010 100110 
##      8      1      1 
## 
## $`(2,3]`
## 100000 010000 001000 110000 000100 101000 000010 100100 010100 000001 
##    126     99    142      8     77      2     27      6      3      9 
## 100010 010010 001010 101010 000110 
##     35      1     27      4      7 
## 
## $`(3,4]`
## 100000 010000 110000 001000 000100 101000 000010 011000 000001 100010 
##    176    276     20     82    122      6     66      3     87     45 
## 010100 001100 010010 001010 110010 101010 000110 100110 
##      4      1      4     11      1      6      5      1 
## 
## $`(4,5]`
## 100000 010000 110000 001000 000100 101000 000010 000001 100010 110010 
##     23     38      4      4      4      1     10     28      6      1 
## 101010 
##      1

To compare between multiple groups with UpSet plots, we need to normalize all the matrices to make them have same sets and same combination sets. normalize_comb_mat() basically adds zero to the new combination sets which were not there before.

m_list = normalize_comb_mat(m_list)
sapply(m_list, comb_size)
##        (0,1] (1,2] (2,3] (3,4] (4,5]
## 010000     1     7    99   276    38
## 001000     2    38   142    82     4
## 000100     1    14    77   122     4
## 000001     1     2     9    87    28
## 100000     0    14   126   176    23
## 110000     0     1     8    20     4
## 101000     0     4     2     6     1
## 000010     0     3    27    66    10
## 100100     0     5     6     0     0
## 100010     0     5    35    45     6
## 001010     0     8    27    11     0
## 101010     0     1     4     6     1
## 100110     0     1     0     1     0
## 010100     0     0     3     4     0
## 010010     0     0     1     4     0
## 000110     0     0     7     5     0
## 011000     0     0     0     3     0
## 001100     0     0     0     1     0
## 110010     0     0     0     1     1

We calculate the range for the two barplots:

max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))

And finally we add the five UpSet plots vertically:

ht_list = NULL
for(i in seq_along(m_list)) {
    ht_list = ht_list %v%
        UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size)),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
ht_list

After comparing the five UpSet plots, we can see most of the movies are rated between 2 and 4. Horror movies tend to have lower ratings and romance moves tend to have higher ratings.

Instead of directly comparing the size of the combination sets, we can also compare the relative fraction to the full sets. In following code, we remove the group of c(0, 1] because the number of movies are too few there.

m_list = m_list[-1]
max_set_size = max(sapply(m_list, set_size))
rel_comb_size = sapply(m_list, function(m) {
    s = comb_size(m)
    # because the combination matrix is generated under "distinct" mode
    # the sum of `s` is the size of the full set
    s/sum(s)
})
ht_list = NULL
for(i in seq_along(m_list)) {
    ht_list = ht_list %v%
        UpSet(m_list[[i]], row_title = paste0("rating in", names(m_list)[i]),
            set_order = NULL, comb_order = NULL,
            top_annotation = HeatmapAnnotation(
                "Relative\nfraction" = anno_barplot(
                    rel_comb_size[, i],
                    ylim = c(0, 0.5),
                    gp = gpar(fill = "black"),
                    border = FALSE,
                    height = unit(2, "cm"),
                ), 
                annotation_name_side = "left",
                annotation_name_rot = 0),
            right_annotation = upset_right_annotation(m_list[[i]], 
                ylim = c(0, max_set_size))
        )
}
ht_list

Now the trend is more clear that horror movies are rated low and documentaries are rated high.

Next we split the movies by years:

year = floor(movies$ReleaseDate/10)*10
m_list = tapply(seq_len(nrow(movies)), year, function(ind) {
    make_comb_mat(movies[ind, genre, drop = FALSE], remove_complement_set = TRUE)
})
m_list = normalize_comb_mat(m_list)
max_set_size = max(sapply(m_list, set_size))
max_comb_size = max(sapply(m_list, comb_size))
ht_list1 = NULL
for(i in 1:5) {
    ht_list1 = ht_list1 %v%
        UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
                height = unit(2, "cm")),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}

ht_list2 = NULL
for(i in 6:10) {
    ht_list2 = ht_list2 %v%
        UpSet(m_list[[i]], row_title = paste0(names(m_list)[i], "s"),
            set_order = NULL, comb_order = NULL,
            top_annotation = upset_top_annotation(m_list[[i]], ylim = c(0, max_comb_size),
                height = unit(2, "cm")),
            right_annotation = upset_right_annotation(m_list[[i]], ylim = c(0, max_set_size)))
}
grid.newpage()
pushViewport(viewport(x = 0, width = 0.5, just = "left"))
draw(ht_list1, newpage = FALSE)
popViewport()
pushViewport(viewport(x = 0.5, width = 0.5, just = "left"))
draw(ht_list2, newpage = FALSE)
popViewport()

Now we can see most of the movies were produces in 1990s and the two major genres are actions and romance.

Similarly, if we change the top annotation to the relative fraction to the full sets (code not shown):

Finally we can add the statistics of years, ratings and number of watches for each combination set as boxplot annotations to the right of the UpSet plot.

m = make_comb_mat(movies[, genre], remove_complement_set = TRUE)
m = m[comb_degree(m) > 0]
comb_elements = lapply(comb_name(m), function(nm) extract_comb(m, nm))
years = lapply(comb_elements, function(ind) movies$ReleaseDate[ind])
rating = lapply(comb_elements, function(ind) movies$AvgRating[ind])
watches = lapply(comb_elements, function(ind) movies$Watches[ind])

UpSet(t(m)) + rowAnnotation(years = anno_boxplot(years),
    rating = anno_boxplot(rating),
    watches = anno_boxplot(watches))

We can see the movies with genre “Scifi + Children” were produced quite old but the ratings are not bad. The movies with genre “Action + Children” have the lowest ratings.

8.8 Example with the genomic regions

The H3K4me3 ChIP-seq peaks from six Roadmap samples are visualized by UpSet plot. The six samples are:

First read the files and convert to GRanges objects.

file_list = c(
    "ESC" = "data/E016-H3K4me3.narrowPeak.gz",
    "ES-deriv1" = "data/E004-H3K4me3.narrowPeak.gz",
    "ES-deriv2" = "data/E006-H3K4me3.narrowPeak.gz",
    "Brain" = "data/E071-H3K4me3.narrowPeak.gz",
    "Muscle" = "data/E100-H3K4me3.narrowPeak.gz",
    "Heart" = "data/E104-H3K4me3.narrowPeak.gz"
)
library(GenomicRanges)
peak_list = lapply(file_list, function(f) {
    df = read.table(f)
    GRanges(seqnames = df[, 1], ranges = IRanges(df[, 2], df [, 3]))
})

Make the combination matrix. Note now the size of the sets and the combination sets are total base pairs or the sum of width of the regions. We only keep the combination sets with more than 500kb.

m = make_comb_mat(peak_list)
m = m[comb_size(m) > 500000]
UpSet(m)

We can nicely format the axis labels by setting axis_param:

UpSet(m, 
    top_annotation = upset_top_annotation(
        m,
        axis_param = list(at = c(0, 1e7, 2e7),
            labels = c("0MB", "10MB", "20MB")),
        height = unit(4, "cm")
    ),
    right_annotation = upset_right_annotation(
        m,
        axis_param = list(at = c(0, 2e7, 4e7, 6e7),
            labels = c("0MB", "20MB", "40MB", "60MB"),
            labels_rot = 0),
        width = unit(4, "cm")
    ))

With each set of genomic regions, we can associate more information to it, such as the mean methylation or the distance to nearest TSS.

subgroup = c("ESC" = "group1",
    "ES-deriv1" = "group1",
    "ES-deriv2" = "group1",
    "Brain" = "group2",
    "Muscle" = "group2",
    "Heart" = "group2"
)
comb_sets = lapply(comb_name(m), function(nm) extract_comb(m, nm))
comb_sets = lapply(comb_sets, function(gr) {
    # we just randomly generate dist_to_tss and mean_meth
    gr$dist_to_tss = abs(rnorm(length(gr), mean = runif(1, min = 500, max = 2000), sd = 1000))
    gr$mean_meth = abs(rnorm(length(gr), mean = 0.1, sd = 0.1))
    gr
})
UpSet(m, 
    top_annotation = upset_top_annotation(
        m,
        axis_param = list(at = c(0, 1e7, 2e7),
            labels = c("0MB", "10MB", "20MB")),
        height = unit(4, "cm")
    ),
    right_annotation = upset_right_annotation(
        m,
        axis_param = list(at = c(0, 2e7, 4e7, 6e7),
            labels = c("0MB", "20MB", "40MB", "60MB"),
            labels_rot = 0),
        width = unit(4, "cm")
    ),
    left_annotation = rowAnnotation(group = subgroup[set_name(m)], show_annotation_name = FALSE),
    bottom_annotation = HeatmapAnnotation(
        dist_to_tss = anno_boxplot(lapply(comb_sets, function(gr) gr$dist_to_tss), outline = FALSE),
        mean_meth = sapply(comb_sets, function(gr) mean(gr$mean_meth)),
        annotation_name_side = "left"
    )
)