Hierarchical Clustering Using R Studio

Nur Mutmainnah Djafar
6 min readJul 25, 2021

What is clustering?

A cluster is a subset of data which are similar. Clustering (also called unsupervised learning) is the process of dividing a dataset into groups such that the members of each group are as similar (close) as possible to one another, and different groups are as dissimilar (far) as possible from one another. Clustering can uncover previously undetected relationships in a dataset. There are many applications for cluster analysis. For example, in business, cluster analysis can be used to discover and characterize customer segments for marketing purposes and in biology, it can be used for classification of plants and animals given their features.

Types of Clustering

Two types of clustering algorithms are hierarchical and nonhierarchical.

Hierarchical clustering is basically an unsupervised clustering technique which involves creating clusters in a predefined order. The clusters are ordered in a top to bottom manner. In this type of clustering, similar clusters are grouped together and are arranged in a hierarchical manner. It can be further divided into two types namely agglomerative hierarchical clustering and Divisive hierarchical clustering. In this clustering, we link the pairs of clusters all the data objects are there in the hierarchy.

Non Hierarchical Clustering involves formation of new clusters by merging or splitting the clusters. It does not follow a tree like structure like hierarchical clustering. This technique groups the data in order to maximize or minimize some evaluation criteria. K means clustering is an effective way of non hierarchical clustering. In this method the partitions are made such that non-overlapping groups having no hierarchical relationships between themselves.

Do Analyze

In this case, i will do the analysis using hierarchical clustering method. The data is from R Studio namely “USArrest”. USArrest is violent crime rates by US state data. This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas. A data frame with 50 observations on 4 variables namely:
1. Murder : numeric murder arrests (per 100,000)
2. Assault : numeric assault arrests (per 100,000)
3. Urban : pop numeric percent urban population
4. Rape : numeric rape arrests (per 100,000)

Step 1 : Cleaning The Data and Statistics Descriptive

First of all call the data to be used

data("USArrests")

Checking the outlier data using the quan method. Before that, enable the “mvn” package first.

library(MVN)
outlier=mvn(USArrests,multivariateOutlierMethod = "quan", showNewData = TRUE)

Based on the plot above, there are 8 outliers in the data. But I’ll just discard the Alaska data because it’s so far away from the other observations.

USArrests <- USArrests[-2,]

Then display a summary of the data by using the command

summarry(USArrests)

From the summary of the data above, it can be seen the minimum value, Q1, median, mean, Q3, and the maximum value of each variable.

Step 2 : Assumption Test

Then test the multicollinearity assumption on the data by looking at the VIF value between one variable and another. If the value of VIF < 10 then there is no multicollinearity in the data.

library(car)
attach(USArrests)
mult1=vif(lm(Murder~Assault+UrbanPop+Rape))
mult1
mult2=vif(lm(Assault~Murder+UrbanPop+Rape))
mult2
mult3=vif(lm(UrbanPop~Murder+Assault+Rape))
mult3
mult4=vif(lm(Rape~Murder+Assault+UrbanPop))
mult4
cbind.data.frame(mult1, mult2, mult3, mult4)

It can be seen that the VIF value above is not more than 10. It can be concluded that the USArrests data does not contain multicollinearity.

Step 3 : Hierarchical Clustering

Before doing clustering, look for the optimal k value first by using the Silhouette method.

library(factoextra)
fviz_nbclust(USArrests, FUN=hcut, method = "silhouette")

From the plot above, the optimum value is 2. Then the data will be grouped into 2 clusters.

There are various methods for performing clustering. I tried using the average linkage method, single linkage method, ward method, complete method, and centroid method. Of the five methods, the best method will be determined by calculating the cophenetic correlation value of each variable. The cophenetic correlation coefficient is the correlation coefficient between the original elements of the dissimilarity matrix (squared ecluidean distance matrix) and the elements generated by the dendogram (cophenetic matrix). Cophenetic correlation coefficient values ​​range from 1 to -1. The closer the value is to 1, the better the clustering is done. (Pratiwi, Widiharih dan Hakim 2019)

#Average Linkage Method
metode_al<-hclust(clust,"ave")
hc_ave = cophenetic(metode_al)
cor.ave = cor(clust,hc_ave)
#Single Linkage Method
metode_sl<-hclust(clust,"single")
hc_single = cophenetic(metode_sl)
cor.single = cor(clust,hc_single)
#Complete Linkage Method
metode_cl<-hclust(clust,"complete")
hc_comp = cophenetic(metode_cl)
cor.comp = cor(clust,hc_single)
#Ward Method
metode_w<-hclust(clust,"ward.D")
hc_w = cophenetic(metode_w)
cor.w = cor(clust,hc_w)
#Centroid Method
metode_cd<-hclust(clust,"centroid")
hc_cd = cophenetic(metode_cd)
cor.cd = cor(clust,hc_cd)
#Displays the cophenetic correlation value to 1 line
cbind.data.frame(cor.ave, cor.comp, cor.single, cor.w, cor.cd)

Based on the results of the cophenetic correlation coefficient of the five methods, it is known that the average linkage method has the largest value, which is 0.7657517. Then the clustering will use the average linkage method.

plot(metode_al)
rect.hclust(metode_al,2)

To see the members of each group can use the command

anggota<-cutree(metode_al,2) 
tabel=data.frame(USArrests,anggota)
View(tabel)

Based on the dendogram and figure above, cluster 1 consists of 15 states and cluster 2 consists of 34 states.

Step 4 : Cluster Profiling

Based on the members of each cluster that have been obtained, then profiling is carried out to determine the characteristics of each cluster that has been obtained. Profiling is done by finding the average value for each variable in each cluster

cluster1 <- subset(USArrests, anggota==1)
cluster2 <- subset(USArrests, anggota==2)
cluster_1 <- sapply(cluster1, mean)
cluster_2 <- sapply(cluster2, mean)
mean_total=rbind(cluster_1,cluster_2)
mean_total

where the red line represents the high value and the blue line represents the low value.

Based on the profiling, the characteristics of each cluster are obtained, namely:

Cluster 1: USA states with the highest incidence of violent crime rates are Alabama, Arizona, California, Delaware, Florida, Illinois, Louisiana, Maryland, Michigan, Mississippi, Nevada, New Mexico, New York, North Carolina, South Carolina.

Cluster 2: USA states that have the lowest incidence of violent crime rates, namely Arkansas, Colorado, Connecticut, Georgia, Hawaii, Idaho, Indiana, Iowa, Kansas, Kentucky, Maine, Massachuseetts, Minnesota, Missouri, Montana, Nebraska, New Hampshire, New Jersey, North Dakota, Ohio, Oklahoma, Oregon, Pennsylvania, Rhode Island, Tennessee, Texas, Utah, Vermont, Virginia, Washington, West Virginia, Wisconsin, Wyoming.

FINISH!!

That’s all, may be useful for you guys…

Sources :

http://www2.cs.uregina.ca/~dbd/cs831/notes/clustering/clustering.html

https://www.geeksforgeeks.org/difference-between-hierarchical-and-non-hierarchical-clustering/

--

--