Cluster AnalysisHigh Dimensional Data AnalysisAnastasios Panagiotelis & Ruben Loaiza-MayaLecture 41

Why Clustering?2

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
3

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
Sometimes the purpose is to segment based on a single variable: 
3

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
Sometimes the purpose is to segment based on a single variable: Gender 

3

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
Sometimes the purpose is to segment based on a single variable: Gender 
Age 

3

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
Sometimes the purpose is to segment based on a single variable: Gender 
Age 
Income 

3

Market SegmentationA common strategy in marketing is to analyse different segments of the market. 
Sometimes the purpose is to segment based on a single variable: Gender 
Age 
Income 

An alternative is to segment using all available information
3

A 2-dimensional exampleConsider that data is collected for customers’ age and income.
4

A 2-dimensional exampleConsider that data is collected for customers’ age and income.
These can be plotted on a scatterplot to see if any obvious
segments or clusters are present.
4

A 2-dimensional exampleConsider that data is collected for customers’ age and income.
These can be plotted on a scatterplot to see if any obvious
segments or clusters are present.
The following data are not real data but are simulated
4

Age v Income

Obvious clusters

Only income

Only age

SummaryUsing just one variable can be misleading.
9

SummaryUsing just one variable can be misleading.
When there are more than 2 variables just looking at a scatterplot doesn’t work.
9

SummaryUsing just one variable can be misleading.
When there are more than 2 variables just looking at a scatterplot doesn’t work.
Instead algorithms can be used to find the clusters in a sensible way, even in high dimensions.
9

Real Example 1The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
10

Real Example 1The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
There are 32 cars which are measured on 11 variables such as miles per gallon, number of cylinders, horsepower and weight.
10

Real Example 1The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
There are 32 cars which are measured on 11 variables such as miles per gallon, number of cylinders, horsepower and weight.
It can be loaded into the workspace using the command data(mtcars)
10

MT Cars data
 
    MakeModel 
    mpg 
    cyl 
    disp 
    hp 
    drat 
    wt 
    qsec 
    vs 
    am 
    gear 
    carb 
  


    Mazda RX4 
0 

0 

90 
620 
46 




  

    Mazda RX4 Wag 
0 

0 

90 
875 
02 




  

    Datsun 710 
8 

0 

85 
320 
61 




  

    Hornet 4 Drive 
4 

0 

08 
215 
44 




  

    Hornet Sportabout 
7 

0 

15 
440 
02 




  

    Valiant 
1 

0 

76 
460 
22 




  

    Duster 360 
3 

0 

21 
570 
84 




  

    Merc 240D 
4 

7 

69 
190 
00 




  

    Merc 230 
8 

8 

92 
150 
90 




  

11

MakeModel	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2

Dendrogram

Real Example 2A business to business example with 440 customers of a wholesaler
13

Real Example 2A business to business example with 440 customers of a wholesaler
The variables are annual spend in the following 6 categories:
13

Real Example 2A business to business example with 440 customers of a wholesaler
The variables are annual spend in the following 6 categories:Fresh food
Milk
Groceries
Frozen
Detergents/Paper
Delicatessen

13

Real Example 2A business to business example with 440 customers of a wholesaler
The variables are annual spend in the following 6 categories:Fresh food
Milk
Groceries
Frozen
Detergents/Paper
Delicatessen

These data are available on Moodle.
13

Cluster centroids

After clustering we get the following cluster means.

Cluster	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
1	35941	6044	6289	6714	1040	3049
2	8253	3825	5280	2573	1773	1137
3	8000	18511	27574	1997	12407	2252

The clusters may represent hotels, supermarkets and cafes.

Approaches to ClusteringHierarchical: Path of solutions:
15

Approaches to ClusteringHierarchical: Path of solutions:Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.

15

Approaches to ClusteringHierarchical: Path of solutions:Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.
Divisive: At start all observations in one cluster. Split step by step until each observation is in its own cluster.

15

Approaches to ClusteringHierarchical: Path of solutions:Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.
Divisive: At start all observations in one cluster. Split step by step until each observation is in its own cluster.

Non-hierarchical: Choose the number of clusters ex ante. No merging or splitting.
15

Our focusOur main focus will be on agglomerative hierarchical methods.
16

Our focusOur main focus will be on agglomerative hierarchical methods.
Divisive hierarchical methods are very slow and we do not cover them at all.
16

Our focusOur main focus will be on agglomerative hierarchical methods.
Divisive hierarchical methods are very slow and we do not cover them at all.
We consider one example of a non-hierarchical method known as the k-means algorithm.
16

Definition of ClusteringOxford Dictionary: A group of similar things or people positioned or occurring closely together
17

Definition of ClusteringOxford Dictionary: A group of similar things or people positioned or occurring closely together
Collins Dictionary: A number of things growing, fastened, or occurring close together
17

Definition of ClusteringOxford Dictionary: A group of similar things or people positioned or occurring closely together
Collins Dictionary: A number of things growing, fastened, or occurring close together
Note the importance of closeness or distance.  We need two concepts of distance
17

Definition of ClusteringOxford Dictionary: A group of similar things or people positioned or occurring closely together
Collins Dictionary: A number of things growing, fastened, or occurring close together
Note the importance of closeness or distance.  We need two concepts of distanceDistance between observations.
Distance between clusters.

17

A distance between clustersLet AA be a cluster with observations {a1,a2,…,aI}{a1,a2,…,aI} and BB be a cluster with points {b1,b2,…,bJ}{b1,b2,…,bJ}.
18

A distance between clustersLet AA be a cluster with observations {a1,a2,…,aI}{a1,a2,…,aI} and BB be a cluster with points {b1,b2,…,bJ}{b1,b2,…,bJ}.
The calligraphic script AA or BB denotes a cluster with possibly more than one point.  
18

A distance between clustersLet AA be a cluster with observations {a1,a2,…,aI}{a1,a2,…,aI} and BB be a cluster with points {b1,b2,…,bJ}{b1,b2,…,bJ}.
The calligraphic script AA or BB denotes a cluster with possibly more than one point.  
The bold scipt aiai or bjbj denotes a vector of attributes (e.g. age and income) for each observation.
18

A distance between clustersLet AA be a cluster with observations {a1,a2,…,aI}{a1,a2,…,aI} and BB be a cluster with points {b1,b2,…,bJ}{b1,b2,…,bJ}.
The calligraphic script AA or BB denotes a cluster with possibly more than one point.  
The bold scipt aiai or bjbj denotes a vector of attributes (e.g. age and income) for each observation.
Rather than vectors, it is much easier to think of each observation as a point in a scatterplot. 
18

Single Linkage

One way of defining the distance between clusters $A$ and $B$ is

$D (A, B) = min_{i, j} D (a_{i}, b_{j})$

This is called single linkage or nearest neighbour.

Single Linkage

Complete Linkage

Another way of defining the distance between $A$ and $B$ is

$D (A, B) = max_{i, j} D (a_{i}, b_{j})$

This is called complete linkage or furthest neighbour.

Complete Linkage

Complete linkageIn the previous example all points in the red cluster are within a distance of 160.01 of all points in the blue cluster.
This is why it is called complete linkage.
25

A simple exampleOver the next couple of slides we will go through the entire process of agglomerative clustering
26

A simple exampleOver the next couple of slides we will go through the entire process of agglomerative clusteringWe will use Euclidean distance to define distance between points

26

A simple exampleOver the next couple of slides we will go through the entire process of agglomerative clusteringWe will use Euclidean distance to define distance between points
We will use single linkage to define the distance between clusters

26

A simple exampleOver the next couple of slides we will go through the entire process of agglomerative clusteringWe will use Euclidean distance to define distance between points
We will use single linkage to define the distance between clusters

There are only five observations and two variables
26

Agglomerative clustering

Hierarchical Clustering5-cluster solution A and B and C and D and E
37

Hierarchical Clustering5-cluster solution A and B and C and D and E
4-cluster solution {A,D} and B and C and E  
37

Hierarchical Clustering5-cluster solution A and B and C and D and E
4-cluster solution {A,D} and B and C and E  
3-cluster solution {A,D} and {B, C} and E
37

Hierarchical Clustering5-cluster solution A and B and C and D and E
4-cluster solution {A,D} and B and C and E  
3-cluster solution {A,D} and {B, C} and E
2-cluster solution {A,B, C,D} and E
37

Hierarchical Clustering5-cluster solution A and B and C and D and E
4-cluster solution {A,D} and B and C and E  
3-cluster solution {A,D} and {B, C} and E
2-cluster solution {A,B, C,D} and E
1-cluster solution {A,B, C,D E}
37

DendrogramThe Dendrogram is a useful tool for analysing a cluster solution.
38

DendrogramThe Dendrogram is a useful tool for analysing a cluster solution.Observations are on one axis (usually x)

38

DendrogramThe Dendrogram is a useful tool for analysing a cluster solution.Observations are on one axis (usually x)
The distance between clusters is on other axis (usually y).

38

DendrogramThe Dendrogram is a useful tool for analysing a cluster solution.Observations are on one axis (usually x)
The distance between clusters is on other axis (usually y).
From the Dendrogram one can see the order in which the clusters are merged.

38

Dendrogram

Interpretation of DendrogramThink of the axis with distance (y-axis) as the measuring a 'tolerance level'
40

Interpretation of DendrogramThink of the axis with distance (y-axis) as the measuring a 'tolerance level'
If the distance between two clusters is within the tolerance they are merged into one cluster.
40

Interpretation of DendrogramThink of the axis with distance (y-axis) as the measuring a 'tolerance level'
If the distance between two clusters is within the tolerance they are merged into one cluster.
As tolerance increases more and more clusters are merged leading to less clusters overall.
40

Clustering in RClustering in R requires at most 3 steps
41

Clustering in RClustering in R requires at most 3 stepsStandardise the data if they are in different units (using the function scale)

41

Clustering in RClustering in R requires at most 3 stepsStandardise the data if they are in different units (using the function scale)
Find the distance between all pairs of observations (using the function dist)

41

Clustering in RClustering in R requires at most 3 stepsStandardise the data if they are in different units (using the function scale)
Find the distance between all pairs of observations (using the function dist)
Cluster the data using the function hclust

41

Clustering in RClustering in R requires at most 3 stepsStandardise the data if they are in different units (using the function scale)
Find the distance between all pairs of observations (using the function dist)
Cluster the data using the function hclust

Try this with the mtcars dataset. Use Euclidean distance and complete linkage. 
Store the result of hclust in a variable called CarsCluster.
41

Clustering in R

data(mtcars)
mtcars%>%
  scale%>%
  dist%>%
  hclust(method="complete")->
  CarsCluster

Dendrogram in R

plot(CarsCluster,cex=0.5)

Identifying clusters

CarsCluster%>%plot(cex=0.5)
CarsCluster%>%rect.hclust(k=2)

Dendrogram in R

For an interactive tool try:

identify(CarsCluster)

Press the escape key when you are finished.

Choosing the number of clusters46

Choosing clustersAlthough hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
47

Choosing clustersAlthough hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
There is no correct number of clusters.  Choosing the number of clusters depends on the context.
47

Choosing clustersAlthough hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
There is no correct number of clusters.  Choosing the number of clusters depends on the context.
There are however poor choices for the number of clusters.
47

Choosing clustersDo not choose too many clusters:
48

Choosing clustersDo not choose too many clusters:  A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

48

Choosing clustersDo not choose too many clusters:  A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

Do not choose too few clusters:
48

Choosing clustersDo not choose too many clusters:  A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.

Do not choose too few clusters:If you choose the 1-cluster solution there is no point in doing clustering at all.

48

Using dendrogramOne criterion is that the number of clusters is stable over a wide range of tolerance.
49

Using dendrogramOne criterion is that the number of clusters is stable over a wide range of tolerance.
The plot on the next slide shows a 3 cluster solution.
49

Three cluster solution

StabilityThe tolerance for a three cluster solution is about 5.9. 
51

StabilityThe tolerance for a three cluster solution is about 5.9. 
If the tolerance is increased by a very small amount then we will have a two cluster solution.
51

StabilityThe tolerance for a three cluster solution is about 5.9. 
If the tolerance is increased by a very small amount then we will have a two cluster solution.
If the tolerance is decreased by a very small amount then we will have a four cluster solution.
51

Two cluster solution

Four cluster solution

StabilityIn the previous example
54

StabilityIn the previous exampleThe three cluster solution is not stable

54

StabilityIn the previous exampleThe three cluster solution is not stable
The two and four cluster solutions are stable

54

StabilityIn the previous exampleThe three cluster solution is not stable
The two and four cluster solutions are stable

In general look for a long stretch of tolerance, over which the number of clusters does not change.
54

Extracting the clusters

For a given number of clusters we can create a new variable indicating cluster membership via the cutree function.

mem<-cutree(CarsCluster,2)

	x
Mazda RX4	1
Mazda RX4 Wag	1
Datsun 710	2
Hornet 4 Drive	2
Hornet Sportabout	1
Valiant	2
Duster 360	1
Merc 240D	2
Merc 230	2
Merc 280	2
Merc 280C	2
Merc 450SE	1
Merc 450SL	1
Merc 450SLC	1
Cadillac Fleetwood	1
Lincoln Continental	1
Chrysler Imperial	1
Fiat 128	2
Honda Civic	2
Toyota Corolla	2
Toyota Corona	2
Dodge Challenger	1
AMC Javelin	1
Camaro Z28	1
Pontiac Firebird	1
Fiat X1-9	2
Porsche 914-2	2
Lotus Europa	2
Ford Pantera L	1
Ferrari Dino	1
Maserati Bora	1
Volvo 142E	2

Pros and Cons of Single LinkagePros:Single linkage is very easy to understand.
Single linkage is a very fast algorithm.

56

Pros and Cons of Single LinkagePros:Single linkage is very easy to understand.
Single linkage is a very fast algorithm.

Cons:Single linkage is very sensitive to single observations which leads to chaining.
Complete linkage avoids this problem and gives more compact clusters with a similar diameter.

56

Chaining

Single Linkage Dendrogram

Single Linkage

Add one observation

New solution

Dendrogram with Chaining

RobustnessIn general adding a single observation should not dramatically change the analysis.
63

RobustnessIn general adding a single observation should not dramatically change the analysis.
In this instance the new observation was not even an outlier.
63

RobustnessIn general adding a single observation should not dramatically change the analysis.
In this instance the new observation was not even an outlier.
A term used for such an observation is an inlier.
63

RobustnessIn general adding a single observation should not dramatically change the analysis.
In this instance the new observation was not even an outlier.
A term used for such an observation is an inlier.
Methods that are not affected by single observations are often called robust.
63

RobustnessIn general adding a single observation should not dramatically change the analysis.
In this instance the new observation was not even an outlier.
A term used for such an observation is an inlier.
Methods that are not affected by single observations are often called robust.
Let's see if complete linkage is robust to the inlier.
63

Complete Linkage

Complete Linkage: Dendrogram

Disadvantages of CLComplete Linkage overcomes chaining and is robust to inliers
66

Disadvantages of CLComplete Linkage overcomes chaining and is robust to inliers
However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
66

Disadvantages of CLComplete Linkage overcomes chaining and is robust to inliers
However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
The following methods are more robust and should be preferred
66

Disadvantages of CLComplete Linkage overcomes chaining and is robust to inliers
However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
The following methods are more robust and should be preferredAverage Linkage
Centroid Method
Ward’s Method

66

Average Linkage

The distance between two clusters can be defined so that it is based on all the pairwise distances between the elements of each cluster. $D (A, B) = \frac{1}{| A | | B |} \sum_{i = 1}^{| A |} \sum_{j = 1}^{| B |} D (a_{i}, b_{j})$ Here $| A |$ is the number of observations in cluster $A$ and $| B |$ is the number of observations in cluster $B$

Average LinkageAverage linkage can be called different things
68

Average LinkageAverage linkage can be called different thingsBetween groups method.
Unweighted Pair Group Method with Arithmetic mean (UPGMA)

68

Pairwise distances (one obs.)

All pairwise distances

Centroid MethodThe centroid of a cluster can be defined as the mean of all the
points in the cluster.
71

Centroid MethodThe centroid of a cluster can be defined as the mean of all the
points in the cluster.
If AA is a cluster containing the observations aa then the centroid of AA is given by.
71

Centroid MethodThe centroid of a cluster can be defined as the mean of all the
points in the cluster.
If AA is a cluster containing the observations aa then the centroid of AA is given by.
¯a=1|A|∑ai∈Aaia¯=1|A|∑ai∈Aai
71

Centroid MethodThe centroid of a cluster can be defined as the mean of all the
points in the cluster.
If AA is a cluster containing the observations aa then the centroid of AA is given by.
¯a=1|A|∑ai∈Aaia¯=1|A|∑ai∈Aai
The distance between two clusters can then be defined as the distance between the respective centroids.
71

Vector meanRecall that aiai is a vector of attributes, e.g income and age.
72

Vector meanRecall that aiai is a vector of attributes, e.g income and age.
In this case ¯aa¯ is also a vector of attributes.
72

Vector meanRecall that aiai is a vector of attributes, e.g income and age.
In this case ¯aa¯ is also a vector of attributes.
Each element of ¯aa¯ is the mean of a different attribute, e.g. mean income, mean age.
72

Centroid method

Average Linkage v CentroidConsider an example with one variable (although everything works with vectors too).
75

Average Linkage v CentroidConsider an example with one variable (although everything works with vectors too).
Suppose we have the clusters A={0,2}A={0,2} and B={3,5}B={3,5}
75

Average Linkage v CentroidConsider an example with one variable (although everything works with vectors too).
Suppose we have the clusters A={0,2}A={0,2} and B={3,5}B={3,5}
Find the distance AA and BB using
75

Average Linkage v CentroidConsider an example with one variable (although everything works with vectors too).
Suppose we have the clusters A={0,2}A={0,2} and B={3,5}B={3,5}
Find the distance AA and BB usingAverage Linkage
Centroid Method

75

Average LinkageMust find distances between all pairs of observations
76

Average LinkageMust find distances between all pairs of observationsD(a1,b1)=3D(a1,b1)=3
D(a1,b2)=5D(a1,b2)=5
D(a2,b1)=1D(a2,b1)=1
D(a2,b2)=3D(a2,b2)=3

76

Average LinkageMust find distances between all pairs of observationsD(a1,b1)=3D(a1,b1)=3
D(a1,b2)=5D(a1,b2)=5
D(a2,b1)=1D(a2,b1)=1
D(a2,b2)=3D(a2,b2)=3

Averaging these, the distance is 3.
76

Centroid methodFirst find centroids
77

Centroid methodFirst find centroids¯a=1a¯=1
¯b=4b¯=4

77

Centroid methodFirst find centroids¯a=1a¯=1
¯b=4b¯=4

The distance is 3.
77

Centroid methodFirst find centroids¯a=1a¯=1
¯b=4b¯=4

The distance is 3.
Here both methods give the same answer but when vectors are used instead they do not give the same answer in general.
77

Average Linkage v CentroidIn average linkage
78

Average Linkage v CentroidIn average linkageCompute the distances between pairs of observations
Average these distances

78

Average Linkage v CentroidIn average linkageCompute the distances between pairs of observations
Average these distances

In the centroid method
78

Average Linkage v CentroidIn average linkageCompute the distances between pairs of observations
Average these distances

In the centroid methodAverage the observations to obtain the centroid of each cluster.
Find the distance between centroids

78

Ward's methodAll methods so far, merge two clusters when the distance between them is small.
79

Ward's methodAll methods so far, merge two clusters when the distance between them is small.
Ward’s method merges two clusters to minimise within cluster variance.
79

Ward's methodAll methods so far, merge two clusters when the distance between them is small.
Ward’s method merges two clusters to minimise within cluster variance.
Two variations implemented in R.
79

Ward's methodAll methods so far, merge two clusters when the distance between them is small.
Ward’s method merges two clusters to minimise within cluster variance.
Two variations implemented in R.Ward.D2 is the same as the original Ward paper.
Ward.D is actually based on a mistake but can still work quite well.

79

Within Cluster Variance

The within-cluster variance for a cluster $A$ is defined as

$V_{w} (A) = \frac{1}{| A | - 1} S (A)$

where $S (A) = \sum_{a_{i} \in A} [{(a_{i} - \bar{a})}^{'} (a_{i} - \bar{a})]$

Vector notationThe term S(A)=∑ai∈A(ai−¯a)′(ai−¯a)S(A)=∑ai∈A(ai−a¯)′(ai−a¯) uses vector notation, but the idea is simple.
81

Vector notationThe term S(A)=∑ai∈A(ai−¯a)′(ai−¯a)S(A)=∑ai∈A(ai−a¯)′(ai−a¯) uses vector notation, but the idea is simple.
Take the difference of each attribute from its mean (e.g. income, age, etc.)
81

Vector notationThe term S(A)=∑ai∈A(ai−¯a)′(ai−¯a)S(A)=∑ai∈A(ai−a¯)′(ai−a¯) uses vector notation, but the idea is simple.
Take the difference of each attribute from its mean (e.g. income, age, etc.)
Then square them and add together over attributes and observations.
81

Vector notationThe term S(A)=∑ai∈A(ai−¯a)′(ai−¯a)S(A)=∑ai∈A(ai−a¯)′(ai−a¯) uses vector notation, but the idea is simple.
Take the difference of each attribute from its mean (e.g. income, age, etc.)
Then square them and add together over attributes and observations.
The within cluster variance is a total variance across all attributes.
81

Ward's algorithmAt each step we must merge two clusters to form a single cluster.
82

Ward's algorithmAt each step we must merge two clusters to form a single cluster.
Suppose we pick a cluster AA and BB to form a new cluster CC.
82

Ward's algorithmAt each step we must merge two clusters to form a single cluster.
Suppose we pick a cluster AA and BB to form a new cluster CC.
Ward's algorithm chooses AA and BB so that VW(C)VW(C) is as small as possible.
82

Non-hierarchical Clustering83

Non-hierarchical ClusteringIn some analyses the exact number of clusters may be known.
84

Non-hierarchical ClusteringIn some analyses the exact number of clusters may be known.
If so non-hierachical clustering may be used.
84

Non-hierarchical ClusteringIn some analyses the exact number of clusters may be known.
If so non-hierachical clustering may be used.
Perhaps the most widely used non-hierarchical method is k-means clustering.
84

k-meansIn general kk-means seeks to find kk clusters.
85

k-meansIn general kk-means seeks to find kk clusters.
The following condition must be satisfied:
85

k-meansIn general kk-means seeks to find kk clusters.
The following condition must be satisfied:Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.

85

k-meansIn general kk-means seeks to find kk clusters.
The following condition must be satisfied:Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.

Knowing the partition into clusters determines the mean.
85

k-meansIn general kk-means seeks to find kk clusters.
The following condition must be satisfied:Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.

Knowing the partition into clusters determines the mean.
Knowing the means determines the clusters.
85

OptimalityThe objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
86

OptimalityThe objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
Let C={C1,…,Ck}C={C1,…,Ck} be a partitioning of all points into kk clusters.
86

OptimalityThe objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
Let C={C1,…,Ck}C={C1,…,Ck} be a partitioning of all points into kk clusters.
The objective of k-means is to find
argminCk∑h=1S(Ch)argminC∑h=1kS(Ch)
86

NP hardIt is an example of an NP-hard problem
87

NP hardIt is an example of an NP-hard problem
The bad news is that NP-hard problems cannot be easily solved by computers.
87

NP hardIt is an example of an NP-hard problem
The bad news is that NP-hard problems cannot be easily solved by computers.
The good news is that your credit card security also relies on the fact that computers cannot easily solve NP-hard problems.
87

HeuristicFortunately there are algorithms that provide a reasonably good solution to the k-means problem.
88

HeuristicFortunately there are algorithms that provide a reasonably good solution to the k-means problem.
In some cases they may provide the exact solution, although there are no guarantees.
88

HeuristicFortunately there are algorithms that provide a reasonably good solution to the k-means problem.
In some cases they may provide the exact solution, although there are no guarantees.
We will now cover Lloyd's algorithm which provides good intuition into the k-means problem.
88

HeuristicFortunately there are algorithms that provide a reasonably good solution to the k-means problem.
In some cases they may provide the exact solution, although there are no guarantees.
We will now cover Lloyd's algorithm which provides good intuition into the k-means problem.
By default, R implements the more sophisticated (and complicated) Hartigan Wong algorithm.
88

Lloyd's algorithmChoose initial centroids (possibly at random).
89

Lloyd's algorithmChoose initial centroids (possibly at random).
Allocate each observation to cluster corresponding with nearest centroid
89

Lloyd's algorithmChoose initial centroids (possibly at random).
Allocate each observation to cluster corresponding with nearest centroid
Re-compute centroids as the mean of all observations in the cluster
89

Lloyd's algorithmChoose initial centroids (possibly at random).
Allocate each observation to cluster corresponding with nearest centroid
Re-compute centroids as the mean of all observations in the cluster
Repeat steps 2 and 3 until convergence
89

Raw Data

Initial Centroids

Initial Allocation

Re-compute Centroids

Reallocate

Recompute Centroids

Reallocate

Stable solution

Wholesaler DataRecall the Wholesaler data from earlier in the lecture
100

Wholesaler DataRecall the Wholesaler data from earlier in the lecture
The variables are annual spend in 6 categories.
100

Wholesaler DataRecall the Wholesaler data from earlier in the lecture
The variables are annual spend in 6 categories.
Should the data be standardised?
100

Wholesaler DataRecall the Wholesaler data from earlier in the lecture
The variables are annual spend in 6 categories.
Should the data be standardised?
Try to carry out k means clustering using the R function kmeans
100

Wholesaler DataRecall the Wholesaler data from earlier in the lecture
The variables are annual spend in 6 categories.
Should the data be standardised?
Try to carry out k means clustering using the R function kmeans
Find a solution with 3 clusters.
100

k-means in R

To do a three cluster solution

WholesaleCluster<-kmeans(Wholesale,3)

If the data are in a data.frame you may need to select the numeric variables.

101

R outputThe result of the R function kmeans will be a list containing
several entries. The most interesting are
102

R outputThe result of the R function kmeans will be a list containing
several entries. The most interesting areA variable indicating cluster membership is given in cluster

102

R outputThe result of the R function kmeans will be a list containing
several entries. The most interesting areA variable indicating cluster membership is given in cluster
The centroids for each cluster are given in centers

102

R outputThe result of the R function kmeans will be a list containing
several entries. The most interesting areA variable indicating cluster membership is given in cluster
The centroids for each cluster are given in centers
The number of observations in each cluster is given by size

102

R outputThe result of the R function kmeans will be a list containing
several entries. The most interesting areA variable indicating cluster membership is given in cluster
The centroids for each cluster are given in centers
The number of observations in each cluster is given by size
The cluster centroids can be useful for profiling the clusters.

102

Cluster Centroids
 
    Fresh 
    Milk 
    Grocery 
    Frozen 
    Detergents_Paper 
    Delicassen 
  


    8000.04 
    18511.420 
    27573.900 
    1996.680 
    12407.360 
    2252.020 
  

    35941.40 
    6044.450 
    6288.617 
    6713.967 
    1039.667 
    3049.467 
  

    8253.47 
    3824.603 
    5280.455 
    2572.661 
    1773.058 
    1137.497 
  

103

Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
8000.04	18511.420	27573.900	1996.680	12407.360	2252.020
35941.40	6044.450	6288.617	6713.967	1039.667	3049.467
8253.47	3824.603	5280.455	2572.661	1773.058	1137.497

Robustness Check

Since values are sensitive to starting values, we can run the algorithm with many different starting values using the nstart option

WholesaleCluster<-kmeans(Wholesale,3,nstart = 25)

Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
35941.40	6044.450	6288.617	6713.967	1039.667	3049.467
8000.04	18511.420	27573.900	1996.680	12407.360	2252.020
8253.47	3824.603	5280.455	2572.661	1773.058	1137.497

104

Label switchingTwo slides back the second cluster had the highest spend on fresh food.
105

Label switchingTwo slides back the second cluster had the highest spend on fresh food.
One slide back the first cluster that had the highest spend on fresh food.
105

Label switchingTwo slides back the second cluster had the highest spend on fresh food.
One slide back the first cluster that had the highest spend on fresh food.
The centroids were identical, they were just flipped around.  This is called Label switching.  
105

Label switchingTwo slides back the second cluster had the highest spend on fresh food.
One slide back the first cluster that had the highest spend on fresh food.
The centroids were identical, they were just flipped around.  This is called Label switching.  
It does not matter which cluster is first, second or third.  The means are important.
105

Number of clustersThe motivation of k means clustering is that the number of clusters is already known.
106

Number of clustersThe motivation of k means clustering is that the number of clusters is already known.
In principal different choices of kk can be used and compared to one another.
106

Number of clustersThe motivation of k means clustering is that the number of clusters is already known.
In principal different choices of kk can be used and compared to one another.
However, unlike hierarchical clustering, these different solutions can contradict one another.
106

The meaning of non hierarchicalConsider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.
107

The meaning of non hierarchicalConsider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.If two variables are in the same cluster in Solution B then they will be in the same cluster in Solution A

107

The meaning of non hierarchicalConsider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.If two variables are in the same cluster in Solution B then they will be in the same cluster in Solution A

The same is not true for non-hierarchical clustering including k-means clustering.
107

Hierarchical Clustering

Together we will use Ward's method to do hierarchical clustering on the Wholesale data and get the cluster membership from the two and three cluster solutions.

Then you can try the same for k-means

108

Solution

Wholesale%>%
  dist%>%
  hclust(method='ward.D2')->hiercl
cl2<-cutree(hiercl,2)
cl3<-cutree(hiercl,3)
table(cl2,cl3)

	1	2	3
1	261	0	45
2	0	134	0

109

Same exercise for k-means

km2<-kmeans(Wholesale,2)
kmcl2<-km2$cluster
km3<-kmeans(Wholesale,3)
kmcl3<-km3$cluster
table(kmcl2,kmcl3)

	1	2	3
1	0	59	6
2	330	1	44

110

Non-hierarchicalConsider the observations in Cluster 3 when k=3k=3.  When we go from k=3k=3 to k=2k=2
111

Non-hierarchicalConsider the observations in Cluster 3 when k=3k=3.  When we go from k=3k=3 to k=2k=2There are 6 of these observations that go to the new cluster 1.

111

Non-hierarchicalConsider the observations in Cluster 3 when k=3k=3.  When we go from k=3k=3 to k=2k=2There are 6 of these observations that go to the new cluster 1.
The remaining 44 observations go to the new cluster 2.

111

Non-hierarchicalConsider the observations in Cluster 3 when k=3k=3.  When we go from k=3k=3 to k=2k=2There are 6 of these observations that go to the new cluster 1.
The remaining 44 observations go to the new cluster 2.

Notice that there is some label switching as well.
111

Comparing Cluster solutions112

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
113

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.In forecasting compare forecasts to outcomes.

113

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.In forecasting compare forecasts to outcomes.
In regression look at goodness of fit.

113

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.In forecasting compare forecasts to outcomes.
In regression look at goodness of fit.

There is also very little theory to guide us.
113

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.In forecasting compare forecasts to outcomes.
In regression look at goodness of fit.

There is also very little theory to guide us.In regression we know least squares is BLUE under certain assumptions.

113

Comparing Cluster solutionsA challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.In forecasting compare forecasts to outcomes.
In regression look at goodness of fit.

There is also very little theory to guide us.In regression we know least squares is BLUE under certain assumptions.

How do we choose a clustering algorithm?
113

Choosing a methodThere is no ideal method to do hierarchical clustering.
114

Choosing a methodThere is no ideal method to do hierarchical clustering.
A good strategy is to try a few different methods.
114

Choosing a methodThere is no ideal method to do hierarchical clustering.
A good strategy is to try a few different methods.
If there is a clear structure in the data then most methods will give similar results.It is not unusual to find one method yielding very different results.

114

Choosing a methodThere is no ideal method to do hierarchical clustering.
A good strategy is to try a few different methods.
If there is a clear structure in the data then most methods will give similar results.It is not unusual to find one method yielding very different results.

If all methods give vastly different results then perhaps there are no clear clusters in the data.
114

RobustnessWe can check if a clustering solution is robust to different algorithms.
115

RobustnessWe can check if a clustering solution is robust to different algorithms.
For example if the centroid method, average linkage, Ward method and k-means all give similar clusters then we can be confident that the clusters are truly a feature of the data.
115

RobustnessWe can check if a clustering solution is robust to different algorithms.
For example if the centroid method, average linkage, Ward method and k-means all give similar clusters then we can be confident that the clusters are truly a feature of the data.
One way to evaluate this is to look at the Rand Index.
115

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
116

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
Pick two observations at random xx and yy.
116

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
Pick two observations at random xx and yy.xx and yy are in the same cluster in Solution A and the same cluster in Solution B

116

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
Pick two observations at random xx and yy.xx and yy are in the same cluster in Solution A and the same cluster in Solution B
xx and yy are in different clusters in Solution A and different clusters in Solution B

116

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
Pick two observations at random xx and yy.xx and yy are in the same cluster in Solution A and the same cluster in Solution B
xx and yy are in different clusters in Solution A and different clusters in Solution B
xx and yy are in the same cluster in Solution A and the different cluster in Solution B

116

Rand IndexSuppose we have two cluster solutions, Solution A and Solution B.
Pick two observations at random xx and yy.xx and yy are in the same cluster in Solution A and the same cluster in Solution B
xx and yy are in different clusters in Solution A and different clusters in Solution B
xx and yy are in the same cluster in Solution A and the different cluster in Solution B
xx and yy are in different clusters in Solution A and same clusters in Solution B

116

Rand IndexScenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
117

Rand IndexScenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
117

Rand IndexScenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
The Rand Index gives the probability of picking two observations at random that are in agreement.
117

Rand IndexScenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
The Rand Index gives the probability of picking two observations at random that are in agreement.
The Rand Index lies between 0 and 1 and higher numbers indicate agreement.
117

Adjusted Rand IndexEven if observations are clustered at random, there will still be some agreement due to chance.
118

Adjusted Rand IndexEven if observations are clustered at random, there will still be some agreement due to chance.
The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
118

Adjusted Rand IndexEven if observations are clustered at random, there will still be some agreement due to chance.
The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
It is still only equal to 1 if the two clustering solutions are in perfect agreement.
118

Adjusted Rand IndexEven if observations are clustered at random, there will still be some agreement due to chance.
The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
It is still only equal to 1 if the two clustering solutions are in perfect agreement.
The adjusted Rand Index can be computed using the adjustedRandIndex function in the package mclust
118

ConclusionThere are many methods for clustering.
119

ConclusionThere are many methods for clustering.
For this reason a cluster analysis should be carried out carefully and transparently.
119

ConclusionThere are many methods for clustering.
For this reason a cluster analysis should be carried out carefully and transparently.
Although we have focused on algorithms in the lecture, remember that the objective of cluster analysis is to explore the data.
119

ConclusionThere are many methods for clustering.
For this reason a cluster analysis should be carried out carefully and transparently.
Although we have focused on algorithms in the lecture, remember that the objective of cluster analysis is to explore the data.
As such remember to profile the clusters and to provide insight into what these clusters may represent.
119

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help