Topic 17 Hierarchical Clustering: Part 1

17.1 Exercises

You can download a template RMarkdown file to start from here.

These exercises are for practice and are not part of your Homework 7.




To practice the hierarchical clustering algorithm, let’s look at a small example. Suppose we collect the following sepal width and sepal length measurements from 5 irises:

irises <- data.frame(
    Width = c(2.5,2.7,3.2,3.5,3.6),
    Length = c(5.5,6.0,4.5,5.0,4.7)
)
irises
##   Width Length
## 1   2.5    5.5
## 2   2.7    6.0
## 3   3.2    4.5
## 4   3.5    5.0
## 5   3.6    4.7
ggplot(irises, aes(x = Width, y = Length)) +
    geom_point() + 
    geom_text(aes(label = c(1:5)), vjust = 1.5)




In the exercises below, you’ll draw a dendrogram for these 5 specimens by hand. Sketch the following plotting frame on some scrap paper:




  1. Step 1: First fusion
    • Calculate the distance between each pair of specimens:

      dist(irises)

      The idea:

      1 2 3 4 5
      1 0
      2 0
      3 0
      4 0
      5 0
    • Which pair of flowers 1-5 is most similar?
    • Draw the fusion between this pair of leaves on your plot. Be careful about the height at which you draw this fusion.




  1. Step 2: Second fusion
    • Construct a new distance matrix reflecting the distances between each pair of branches (where 4 and 5 have now been fused). Use complete linkage. That is, the distance between one branch and another is the maximum distance between any pair of leaves in those branches. The idea:

      1 2 3 4 & 5
      1 0
      2 0
      3 0
      4 & 5 0
    • Which pair of branches is most similar? Draw the fusion between this pair on the plot.




  1. Step 3: Third fusion
    • Repeat! Construct a new distance matrix reflecting the distances between each pair of branches, those that have been fused already and those that have not.
    • Which pair of branches is most similar? Draw the fusion between this pair on the plot.




  1. Step 4: Final fusion
    At this point, you should have 2 specimens in one cluster and 3 in another. The final step is to combine these into the tree trunk. Draw this fusion.


  2. Check your work in R:

    iris_cluster <- hclust(dist(irises))
    plot(iris_cluster)