Hierarchical clustering is a widely used approach for data analysis.
In this post, I will demonstrate the internal structure of a `hclust`

object.

I first generate a random matrix and apply `hclust()`

.

```
set.seed(123456)
m = matrix(rnorm(25), 5)
rownames(m) = letters[1:5]
hc = hclust(dist(m))
```

Typing the `hc`

object simply prints the basic information of the object.

`hc`

```
##
## Call:
## hclust(d = dist(m))
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 5
```

And `plot()`

function draws the clustering results:

`plot(hc)`

Now, what is hiding behind the `hc`

object? Let’s first try `str()`

function to reveal its internals:

`str(hc)`

```
## List of 7
## $ merge : int [1:4, 1:2] -3 -1 1 -5 -4 -2 2 3
## $ height : num [1:4] 2.01 2.48 3.51 4.11
## $ order : int [1:5] 5 3 4 1 2
## $ labels : chr [1:5] "a" "b" "c" "d" ...
## $ method : chr "complete"
## $ call : language hclust(d = dist(m))
## $ dist.method: chr "euclidean"
## - attr(*, "class")= chr "hclust"
```

So, `hc`

is just a simple list with 7 elements. The last three ones `method`

, `call`

and `dist.method`

are less interesting because they are kind like label marks of the analysis. We will
look at the first four elements.

To correctly understand the `hc`

object, we need to be careful with the orders of the vectors/matrices
in `hc`

. We start with `labels`

and `order`

.

`hc$labels`

`## [1] "a" "b" "c" "d" "e"`

`hc$order`

`## [1] 5 3 4 1 2`

The `labels`

element contains text labels of the items *in their original orders*. Here `hc`

is from a matrix `m`

, so the order of `labels`

corresponds to *the original row order in the matrix*.

The order in `order`

is different from that in `labels`

. As hierarchical clustering reorders
the items, `order`

element contains results after the reordering. In the following order vector:

`hc$order`

`## [1] 5 3 4 1 2`

the first value means item 5 is in position 1 after reordering, item 3 is in position 2 after reordering, etc. So if mapping labels to `order`

, it will be:

`hc$labels[hc$order]`

`## [1] "e" "c" "d" "a" "b"`

As you can see in the plot, after the reordering by hierarchical clustering, item “e” locates on the very left, and then item “c”, “d”, “a” and “b”.

Next let’s check the `merge`

and `height`

elements in the `hc`

object:

`hc$merge`

```
## [,1] [,2]
## [1,] -3 -4
## [2,] -1 -2
## [3,] 1 2
## [4,] -5 3
```

`hc$height`

`## [1] 2.014138 2.478801 3.507676 4.113743`

On the first glance, it may look a little bit unnatural that there are both positive and negative
integers in `merge`

. Actually, `merge`

uses different signs to represent different types of objects in the
clustering, i.e. an item or a sub-cluster.

First we need to know, `hclust()`

applies the agglomerative hierarchical clustering procedure, which is “bottom-up” approach, as shown in the following figures.

So `merge`

records the steps of clustering where negative integers correspond to items (or leaves)
and positive integers correspond to intermediate sub-clusters. Then `merge`

can be read as:

`hc$merge`

```
## [,1] [,2]
## [1,] -3 -4
## [2,] -1 -2
## [3,] 1 2
## [4,] -5 3
```

- Step 1 / row 1: item 3 (“c”) and 4 (“d”) are clustered. The sub-cluster has an index 1.
- Step 2 / row 2: item 1 (“a”) and 2 (“b”) are clustered. The sub-cluster has an index 2.
- Step 3 / row 3: Note now the values are both positive. This means sub-cluster 1 and sub-cluster 2 are merged where sub-cluster 1 was constructed in row 1 and sub-cluster 2 was constructed in row 2.
- Step 4 / row 4: This row contains both negative and positive values. This means a single item 5 (“e”) and a sub-cluster is merged where the sub-cluster is from row 3.

`hc$height`

corresponds to `hc$merge`

, and it is the height of each sub-cluster.

With all the elements in `hc`

being explained, let’s try to draw the clustering. Let’s draw the clustering
according to the steps in `hc$merge`

.

The first step is to cluster item 3 and item 4. We know the height of this sub-cluster is `hc$height[1]`

, but what is missing are the positions of items on x-axis, i.e. the horizontal positions of item 3 and 4. Let’s
try to get them.

We know the following are the reordered labels on the plot:

`hc$labels[hc$order]`

`## [1] "e" "c" "d" "a" "b"`

which locate at `x = 1, 2, 3, 4, 5`

. However, it is not in the original order.
As in `hc$merge`

, the “negative” integers correspond to the orignal order (e.g.
-2 means item 2), thus we need to know the positions of items where items are in
their *original orders*.

This can be done by mapping with the labels:

```
# 1:5 is the positions on x-axis
map = structure(1:5, names = hc$labels[hc$order])
# and we use the named vector to reorder into the original order
x = map[hc$labels]
x
```

```
## a b c d e
## 4 5 2 3 1
```

Or we can directly use the following code which takes the order of `hc$order`

.

```
x = order(hc$order)
x
```

`## [1] 4 5 2 3 1`

In `hc$order`

, values are the original indices of items, and the index of `hc$order`

itself
is the positions on x-axis in the clustering plot.

```
hc$order
value 5 3 4 1 2 # indices of items, i.e. item 5, item 3, ...
pos 1 2 3 4 5 # pos in x-axis in the plot
```

If we apply `order()`

on the vector `hc$order`

which is `5 3 4 1 2`

, the first value
in the returned vector is the position of “1” (item 1) in `5 3 4 1 2`

which is 4,
and the second value is the position of “2” (item 2) in `5 3 4 1 2`

which is 5, etc.

This means, applying `order()`

on `hc$order`

returns a vector in the original order of items
and values are the positions on x-axis.

OK, now we can draw the clustering step-by-step.

Step 1 / row 1 in `hc$merge`

: connects item 3 (“c”) and 4 (“d”):

```
plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)
# the first row in `merge` is '[1,] -3 -4'
segments(x[3], hc$height[1], x[4], hc$height[1])
```

Step 2 / row 2: connects item 1 (“a”) and 2 (“b”):

```
plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)
segments(x[3], hc$height[1], x[4], hc$height[1])
# the second row in `merge` is '[2,] -1 -2'
segments(x[1], hc$height[2], x[2], hc$height[2])
```

Step 3 / row 3: connects sub-cluster 1 and sub-cluster 2. Now there is a new thing. Since we are now connecting two sub-clusters, we need to connect to the “middle points” of the two sub-clusters. This can be done very easily. The middle point of sub-cluster 1 is the middle of item 3 between 4, and the middle point of sub-cluster 2 is the middle of item 1 between 2.

```
plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)
segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[1], hc$height[2], x[2], hc$height[2])
midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2
# the third row in `merge` is '[3,] 1 2'
segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])
```

Step 4 / row 4: connects item 5 (“e”) and sub-cluster 3. Similarly, middle point of sub-cluster 3 needs to be calculated in advance which is the middle of the middle points of its two sub clusters (sub-cluster 1 and 2).

```
plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)
segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[1], hc$height[2], x[2], hc$height[2])
midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2
segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])
midpoint[3] = (midpoint[1] + midpoint[2])/2
# the third row in `merge` is '[4,] -5 3'
segments(x[5], hc$height[4], midpoint[3], hc$height[4])
```

Vertical lines can be added as:

```
plot(NULL, xlim = c(0.5, 5.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:5, labels = hc$labels[hc$order])
axis(side = 2)
segments(x[3], hc$height[1], x[4], hc$height[1])
segments(x[3], 0, x[3], hc$height[1])
segments(x[4], 0, x[4], hc$height[1])
segments(x[1], hc$height[2], x[2], hc$height[2])
segments(x[1], 0, x[1], hc$height[2])
segments(x[2], 0, x[2], hc$height[2])
midpoint = numeric(0)
midpoint[1] = (x[3]+x[4])/2
midpoint[2] = (x[1]+x[2])/2
segments(midpoint[1], hc$height[3], midpoint[2], hc$height[3])
segments(midpoint[1], hc$height[3], midpoint[1], hc$height[1])
segments(midpoint[2], hc$height[3], midpoint[2], hc$height[2])
midpoint[3] = (midpoint[1] + midpoint[2])/2
segments(x[5], hc$height[4], midpoint[3], hc$height[4])
segments(midpoint[3], hc$height[4], midpoint[3], hc$height[3])
segments(x[5], 0, x[5], hc$height[4])
```

We can merge all these processes into a function. The function is quite
straightforward to understand. There are four `if-else`

blocks which deal with:

- both children are leaves,
- left child is a leaf, right child is a sub-cluster,
- left child is a sub-cluster, right child is a leaf,
- both children are sub-clusters.

```
plot_hc = function(hc) {
x = order(hc$order)
nobs = length(x)
if(length(hc$labels) == 0) {
hc$labels = as.character(seq_along(hc$order))
}
plot(NULL, xlim = c(0.5, nobs + 0.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:nobs, labels = hc$labels[hc$order])
axis(side = 2)
merge = hc$merge
order = hc$order
nr = nrow(merge)
midpoint = numeric(nr)
for(i in seq_len(nr)) {
child1 = merge[i, 1]
child2 = merge[i, 2]
if(child1 < 0 && child2 < 0) { # both are leaves
segments(x[ -child1 ],
hc$height[i],
x[ -child2 ],
hc$height[i])
midpoint[i] = (x[ -child1 ] + x[ -child2 ])/2
segments(x[ -child1 ], hc$height[i], x[ -child1 ], 0)
segments(x[ -child2 ], hc$height[i], x[ -child2 ], 0)
} else if(child1 < 0 && child2 > 0) {
segments(x[ -child1 ],
hc$height[i],
midpoint[ child2 ],
hc$height[i])
midpoint[i] = (x[ -child1 ] + midpoint[ child2 ])/2
segments(x[ -child1 ], hc$height[i], x[ -child1 ], 0)
segments(midpoint[ child2 ], hc$height[i], midpoint[ child2 ], hc$height[ child2 ])
} else if(merge[i, 1] > 0 && merge[i, 2] < 0) {
segments(midpoint[ child1 ],
hc$height[i],
x[ -child2 ],
hc$height[i])
midpoint[i] = (midpoint[ child1 ] + x[ -child2 ])/2
segments(midpoint[ child1 ], hc$height[i], midpoint[ child1 ], hc$height[ child1 ])
segments(x[ -child2 ], hc$height[i], x[ -child2 ], 0)
} else {
segments(midpoint[ child1 ],
hc$height[i],
midpoint[ child2 ],
hc$height[i])
midpoint[i] = (midpoint[ child1 ] + midpoint[ child2 ])/2
segments(midpoint[ child1 ], hc$height[i], midpoint[ child1 ], hc$height[ child1 ])
segments(midpoint[ child2 ], hc$height[i], midpoint[ child2 ], hc$height[ child2 ])
}
}
}
```

`plot_hc(hc)`

Let’s try a larger clustering object:

```
m2 = matrix(rnorm(1000), nrow = 100)
hc2 = hclust(dist(m2))
plot_hc(hc2)
```

Connecting a parent node and two children nodes in the clustering graph can be generalized as: with
three points: `(left_x, left_y)`

, `(top_x, top_y)`

and `(right_x, right_y)`

which correspond to
xy coordinates of the left child, parent and right child, we can define a function to draw lines to connect
the three points.

The default connection actually connects the following points:

```
x-coordinates: left_x, left_x, right_x, right_x
y-coordinates: left_y, top_y, top_y, right_y
```

In `plot_hc()`

, let’s abstract the code which draws connections between parent and children nodes to
`parent_children_connections()`

which accepts coordinates of the three points.

```
plot_hc = function(hc) {
x = order(hc$order)
nobs = length(x)
if(length(hc$labels) == 0) {
hc$labels = as.character(seq_along(hc$order))
}
plot(NULL, xlim = c(0.5, nobs + 0.5), ylim = c(0, max(hc$height)*1.1), axes = FALSE, ann = FALSE)
axis(side = 1, at = 1:nobs, labels = hc$labels[hc$order])
axis(side = 2)
merge = hc$merge
order = hc$order
height = hc$height
nr = nrow(merge)
midpoint = numeric(nr)
for(i in seq_len(nr)) {
child1 = merge[i, 1]
child2 = merge[i, 2]
if(child1 < 0 && child2 < 0) { # both are leaves
midpoint[i] = (x[ -child1 ] + x[ -child2 ])/2
parent_children_connections(
x[-child1], 0,
midpoint[i], height[i],
x[-child2], 0
)
} else if(child1 < 0 && child2 > 0) {
midpoint[i] = (x[ -child1 ] + midpoint[ child2 ])/2
parent_children_connections(
x[-child1], 0,
midpoint[i], height[i],
midpoint[child2], height[child2]
)
} else if(child1 > 0 && child2 < 0) {
midpoint[i] = (midpoint[ child1 ] + x[ -child2 ])/2
parent_children_connections(
midpoint[child1], height[child1],
midpoint[i], height[i],
x[-child2], 0
)
} else {
midpoint[i] = (midpoint[ child1 ] + midpoint[ child2 ])/2
parent_children_connections(
midpoint[child1], height[child1],
midpoint[i], height[i],
midpoint[child2], height[child2]
)
}
}
}
```

The following function is the default way to connect parent and children nodes:

```
parent_children_connections = function(left_x, left_y, top_x, top_y, right_x, right_y) {
lines(c(left_x, left_x, right_x, right_x),
c(left_y, top_y, top_y, right_y))
}
plot_hc(hc)
```

We can define a new way as:

```
x-coordinates: left_x, top_x, right_x
y-coordinates: left_y, top_y, right_y
```

```
parent_children_connections = function(left_x, left_y, top_x, top_y, right_x, right_y) {
lines(c(left_x, top_x, right_x),
c(left_y, top_y, right_y))
}
plot_hc(hc)
```

Hierarchical clustering constructs a binary tree, where a parent always has two children in the tree. In the next post, I will introduce a more general structure, the dendrogram.