What is a hclust object

Hierarchical clustering is a widely used approach for data analysis. In this post, I will demonstrate the internal structure of a hclust object.

I first generate a random matrix and apply hclust().

set.seed(123456)
m = matrix(rnorm(25), 5)
rownames(m) = letters[1:5]
hc = hclust(dist(m))

Typing the hc object simply prints the basic information of the object.

hc

## 
## Call:
## hclust(d = dist(m))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 5

And plot() function draws the clustering results:

plot(hc)

Now, what is hiding behind the hc object? Let’s first try str() function to reveal its internals:

str(hc)

## List of 7
##  $ merge      : int [1:4, 1:2] -3 -1 1 -5 -4 -2 2 3
##  $ height     : num [1:4] 2.01 2.48 3.51 4.11
##  $ order      : int [1:5] 5 3 4 1 2
##  $ labels     : chr [1:5] "a" "b" "c" "d" ...
##  $ method     : chr "complete"
##  $ call       : language hclust(d = dist(m))
##  $ dist.method: chr "euclidean"
##  - attr(*, "class")= chr "hclust"

So, hc is just a simple list with 7 elements. The last three ones method, call and dist.method are less interesting because they are kind like label marks of the analysis. We will look at the first four elements.

To correctly understand the hc object, we need to be careful with the orders of the vectors/matrices in hc. We start with labels and order.

hc$labels

## [1] "a" "b" "c" "d" "e"

hc$order

## [1] 5 3 4 1 2

The labels element contains text labels of the items in their original orders. Here hc is from a matrix m, so the order of labels corresponds to the original row order in the matrix.

The order in order is different from that in labels. As hierarchical clustering reorders the items, order element contains results after the reordering. In the following order vector:

hc$order

## [1] 5 3 4 1 2

the first value means item 5 is in position 1 after reordering, item 3 is in position 2 after reordering, etc. So if mapping labels to order, it will be:

hc$labels[hc$order]

## [1] "e" "c" "d" "a" "b"

As you can see in the plot, after the reordering by hierarchical clustering, item “e” locates on the very left, and then item “c”, “d”, “a” and “b”.

Next let’s check the merge and height elements in the hc object:

hc$merge

##      [,1] [,2]
## [1,]   -3   -4
## [2,]   -1   -2
## [3,]    1    2
## [4,]   -5    3

hc$height

## [1] 2.014138 2.478801 3.507676 4.113743

On the first glance, it may look a little bit unnatural that there are both positive and negative integers in merge. Actually, merge uses different signs to represent different types of objects in the clustering, i.e. an item or a sub-cluster.

First we need to know, hclust() applies the agglomerative hierarchical clustering procedure, which is “bottom-up” approach, as shown in the following figures.

So merge records the steps of clustering where negative integers correspond to items (or leaves) and positive integers correspond to intermediate sub-clusters. Then merge can be read as:

hc$merge

##      [,1] [,2]
## [1,]   -3   -4
## [2,]   -1   -2
## [3,]    1    2
## [4,]   -5    3

Step 1 / row 1: item 3 (“c”) and 4 (“d”) are clustered. The sub-cluster has an index 1.
Step 2 / row 2: item 1 (“a”) and 2 (“b”) are clustered. The sub-cluster has an index 2.
Step 3 / row 3: Note now the values are both positive. This means sub-cluster 1 and sub-cluster 2 are merged where sub-cluster 1 was constructed in row 1 and sub-cluster 2 was constructed in row 2.
Step 4 / row 4: This row contains both negative and positive values. This means a single item 5 (“e”) and a sub-cluster is merged where the sub-cluster is from row 3.

hc$height corresponds to hc$merge, and it is the height of each sub-cluster.

With all the elements in hc being explained, let’s try to draw the clustering. Let’s draw the clustering according to the steps in hc$merge.

The first step is to cluster item 3 and item 4. We know the height of this sub-cluster is hc$height[1], but what is missing are the positions of items on x-axis, i.e. the horizontal positions of item 3 and 4. Let’s try to get them.

We know the following are the reordered labels on the plot:

hc$labels[hc$order]

## [1] "e" "c" "d" "a" "b"

which locate at x = 1, 2, 3, 4, 5. However, it is not in the original order. As in hc$merge, the “negative” integers correspond to the orignal order (e.g. -2 means item 2), thus we need to know the positions of items where items are in their original orders.

This can be done by mapping with the labels:

# 1:5 is the positions on x-axis
map = structure(1:5, names = hc$labels[hc$order])
# and we use the named vector to reorder into the original order
x = map[hc$labels]
x

## a b c d e 
## 4 5 2 3 1

Or we can directly use the following code which takes the order of hc$order.

x = order(hc$order)
x

## [1] 4 5 2 3 1

In hc$order, values are the original indices of items, and the index of hc$order itself is the positions on x-axis in the clustering plot.

hc$order
  value 5 3 4 1 2  # indices of items, i.e. item 5, item 3, ...
  pos   1 2 3 4 5  # pos in x-axis in the plot

If we apply order() on the vector hc$order which is 5 3 4 1 2, the first value in the returned vector is the position of “1” (item 1) in 5 3 4 1 2 which is 4, and the second value is the position of “2” (item 2) in 5 3 4 1 2 which is 5, etc.

This means, applying order() on hc$order returns a vector in the original order of items and values are the positions on x-axis.

OK, now we can draw the clustering step-by-step.

Step 1 / row 1 in hc$merge: connects item 3 (“c”) and 4 (“d”):

plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)

# the first row in `merge` is '[1,]   -3   -4'
segments(x[3], hc$height[1], x[4], hc$height[1])

Step 2 / row 2: connects item 1 (“a”) and 2 (“b”):

plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)

segments(x[3], hc$height[1], x[4], hc$height[1])
# the second row in `merge` is '[2,]   -1   -2'
segments(x[1], hc$height[2], x[2], hc$height[2])

Step 3 / row 3: connects sub-cluster 1 and sub-cluster 2. Now there is a new thing. Since we are now connecting two sub-clusters, we need to connect to the “middle points” of the two sub-clusters. This can be done very easily. The middle point of sub-cluster 1 is the middle of item 3 between 4, and the middle point of sub-cluster 2 is the middle of item 1 between 2.

plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)

segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[1], hc$height[2], x[2], hc$height[2])

midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2

# the third row in `merge` is '[3,]    1    2'
segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])

Step 4 / row 4: connects item 5 (“e”) and sub-cluster 3. Similarly, middle point of sub-cluster 3 needs to be calculated in advance which is the middle of the middle points of its two sub clusters (sub-cluster 1 and 2).

plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)

segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[1], hc$height[2], x[2], hc$height[2])

midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2

segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])

midpoint[3] = (midpoint[1] + midpoint[2])/2
# the third row in `merge` is '[4,]   -5    3'
segments(x[5], hc$height[4], midpoint[3], hc$height[4])

Vertical lines can be added as:

plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)

segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[3], 0, x[3], hc$height[1])
segments(x[4], 0, x[4], hc$height[1])

segments(x[1], hc$height[2], x[2], hc$height[2])
segments(x[1], 0, x[1], hc$height[2])
segments(x[2], 0, x[2], hc$height[2])

midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2

segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])
segments(midpoint[1], hc$height[3], midpoint[1], hc$height[1])
segments(midpoint[2], hc$height[3], midpoint[2], hc$height[2])

midpoint[3] = (midpoint[1] + midpoint[2])/2
segments(x[5], hc$height[4], midpoint[3], hc$height[4])
segments(midpoint[3], hc$height[4], midpoint[3], hc$height[3])
segments(x[5], 0, x[5], hc$height[4])

We can merge all these processes into a function. The function is quite straightforward to understand. There are four if-else blocks which deal with:

both children are leaves,
left child is a leaf, right child is a sub-cluster,
left child is a sub-cluster, right child is a leaf,
both children are sub-clusters.

plot_hc = function(hc) {
    x = order(hc$order)
    nobs = length(x)

    if(length(hc$labels) == 0) {
        hc$labels = as.character(seq_along(hc$order))
    }

    plot(NULL, xlim = c(0.5, nobs + 0.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
    axis(side = 1, at = 1:nobs, labels = hc$labels[hc$order])
    axis(side = 2)
        
    merge = hc$merge
    order = hc$order
    nr = nrow(merge)
    midpoint = numeric(nr)

    for(i in seq_len(nr)) {
        child1 = merge[i, 1]
        child2 = merge[i, 2]

        if(child1 < 0 && child2 < 0) { # both are leaves
            segments(x[ -child1 ], 
                     hc$height[i],
                     x[ -child2 ], 
                     hc$height[i])
            midpoint[i] = (x[ -child1 ] + x[ -child2 ])/2
            segments(x[ -child1 ], hc$height[i], x[ -child1 ], 0)
            segments(x[ -child2 ], hc$height[i], x[ -child2 ], 0)
        } else if(child1 < 0 && child2 > 0) {
            segments(x[ -child1 ], 
                     hc$height[i],
                     midpoint[ child2 ], 
                     hc$height[i])
            midpoint[i] = (x[ -child1 ] + midpoint[ child2 ])/2
            segments(x[ -child1 ], hc$height[i], x[ -child1 ], 0)
            segments(midpoint[ child2 ], hc$height[i], midpoint[ child2 ], hc$height[ child2 ])
        } else if(merge[i, 1] > 0 && merge[i, 2] < 0) {
            segments(midpoint[ child1 ], 
                     hc$height[i],
                     x[ -child2 ], 
                     hc$height[i])
            midpoint[i] = (midpoint[ child1 ] + x[ -child2 ])/2
            segments(midpoint[ child1 ], hc$height[i], midpoint[ child1 ], hc$height[ child1 ])
            segments(x[ -child2 ], hc$height[i], x[ -child2 ], 0)
        } else {
            segments(midpoint[ child1 ], 
                     hc$height[i],
                     midpoint[ child2 ], 
                     hc$height[i])
            midpoint[i] = (midpoint[ child1 ] + midpoint[ child2 ])/2
            segments(midpoint[ child1 ], hc$height[i], midpoint[ child1 ], hc$height[ child1 ])
            segments(midpoint[ child2 ], hc$height[i], midpoint[ child2 ], hc$height[ child2 ])
        }
    }
}

plot_hc(hc)

Let’s try a larger clustering object:

m2 = matrix(rnorm(1000), nrow = 100)
hc2 = hclust(dist(m2))
plot_hc(hc2)

Connecting a parent node and two children nodes in the clustering graph can be generalized as: with three points: (left_x, left_y), (top_x, top_y) and (right_x, right_y) which correspond to xy coordinates of the left child, parent and right child, we can define a function to draw lines to connect the three points.

The default connection actually connects the following points:

x-coordinates: left_x, left_x, right_x, right_x
y-coordinates: left_y, top_y, top_y, right_y

In plot_hc(), let’s abstract the code which draws connections between parent and children nodes to parent_children_connections() which accepts coordinates of the three points.

plot_hc = function(hc) {
    x = order(hc$order)
    nobs = length(x)

    if(length(hc$labels) == 0) {
        hc$labels = as.character(seq_along(hc$order))
    }

    plot(NULL, xlim = c(0.5, nobs + 0.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
    axis(side = 1, at = 1:nobs, labels = hc$labels[hc$order])
    axis(side = 2)
            
    merge = hc$merge
    order = hc$order
    height = hc$height
    nr = nrow(merge)
    midpoint = numeric(nr)

    for(i in seq_len(nr)) {
        child1 = merge[i, 1]
        child2 = merge[i, 2]

        if(child1 < 0 && child2 < 0) { # both are leaves
            midpoint[i] = (x[ -child1 ] + x[ -child2 ])/2
            parent_children_connections(
                x[-child1], 0,
                midpoint[i], height[i],
                x[-child2], 0
            )
        } else if(child1 < 0 && child2 > 0) {
            midpoint[i] = (x[ -child1 ] + midpoint[ child2 ])/2
            parent_children_connections(
                x[-child1], 0,
                midpoint[i], height[i],
                midpoint[child2], height[child2]
            )
        } else if(child1 > 0 && child2 < 0) {
            midpoint[i] = (midpoint[ child1 ] + x[ -child2 ])/2
            parent_children_connections(
                midpoint[child1], height[child1],
                midpoint[i], height[i],
                x[-child2], 0
            )
        } else {
            midpoint[i] = (midpoint[ child1 ] + midpoint[ child2 ])/2
            parent_children_connections(
                midpoint[child1], height[child1],
                midpoint[i], height[i],
                midpoint[child2], height[child2]
            )
        }
    }
}

The following function is the default way to connect parent and children nodes:

parent_children_connections = function(left_x, left_y, top_x, top_y, right_x, right_y) {
    lines(c(left_x, left_x, right_x, right_x),
          c(left_y, top_y, top_y, right_y))
}
plot_hc(hc)

We can define a new way as:

x-coordinates: left_x, top_x, right_x
y-coordinates: left_y, top_y, right_y

parent_children_connections = function(left_x, left_y, top_x, top_y, right_x, right_y) {
    lines(c(left_x, top_x, right_x),
          c(left_y, top_y, right_y))
}
plot_hc(hc)

Hierarchical clustering constructs a binary tree, where a parent always has two children in the tree. In the next post, I will introduce a more general structure, the dendrogram.