Linkage methods for Cluster Observations

Average

With the average linkage method, the distance between two clusters is the average distance between an observation in one cluster and an observation in the other cluster. The average distance is calculated with the following distance matrix:

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j
Nknumber of observations in cluster k
Nlnumber of observations in cluster l
Nmnumber of observations in cluster m

Centroid

With the centroid linkage method, the distance between two clusters is the distance between the cluster centroids or means. The distance is calculated with the following distance matrix:

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j
Nknumber of observations in cluster k
Nlnumber of observations in cluster l
Nmnumber of observations in cluster m

Complete

With the complete linkage method (also called furthest neighbor method), the distance between two clusters is the maximum distance between an observation in one cluster and an observation in the other cluster. The complete distance is calculated with the following distance matrix:

dmj = max (dkj, dlj)

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j

McQuitty

With McQuitty's linkage method, the distance is calculated with the following distance matrix:

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j

Median

With the median linkage method, the distance between two clusters is the median distance between an observation in one cluster and an observation in the other cluster. The median distance is calculated with the following distance matrix:

Notation

TermDescription
dmj distance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dlj distance between clusters l and j
dkldistance between clusters k and l

Single

With the single linkage method (also called nearest neighbor method), the distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster. When observations lie close together, single linkage tends to identify long chain-like clusters, with relatively large distances separating observations at either end of the chain.

The distance is calculated with the following distance matrix:

dmj = min (dkj, dlj)

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j

Ward

With Ward's linkage method, the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward's linkage is to minimize the within-cluster sum of squares. The distance is calculated with the following distance matrix:

Note

With Ward's linkage method, the distance between two clusters can be larger than dmax, which is the maximum value in the original distance matrix, D. If this happens, the similarity is negative.

Notation

TermDescription
dmjdistance between clusters m and j
mmerged cluster that consists of clusters k and l, with m = (k,i)
dkjdistance between clusters k and j
dljdistance between clusters l and j
dkldistance between clusters k and l
Njnumber of observations in cluster j
Nknumber of observations in cluster k
Nlnumber of observations in cluster l
Nmnumber of observations in cluster m