+ - 0:00:00
Notes for current slide
Notes for next slide

Cluster Analysis

High Dimensional Data Analysis

Anastasios Panagiotelis & Ruben Loaiza-Maya

Lecture 4

1

Why Clustering?

2

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
3

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
  • Sometimes the purpose is to segment based on a single variable:
3

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
  • Sometimes the purpose is to segment based on a single variable:
    • Gender
3

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
  • Sometimes the purpose is to segment based on a single variable:
    • Gender
    • Age
3

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
  • Sometimes the purpose is to segment based on a single variable:
    • Gender
    • Age
    • Income
3

Market Segmentation

  • A common strategy in marketing is to analyse different segments of the market.
  • Sometimes the purpose is to segment based on a single variable:
    • Gender
    • Age
    • Income
  • An alternative is to segment using all available information
3

A 2-dimensional example

  • Consider that data is collected for customers’ age and income.
4

A 2-dimensional example

  • Consider that data is collected for customers’ age and income.
  • These can be plotted on a scatterplot to see if any obvious segments or clusters are present.
4

A 2-dimensional example

  • Consider that data is collected for customers’ age and income.
  • These can be plotted on a scatterplot to see if any obvious segments or clusters are present.
  • The following data are not real data but are simulated
4

Age v Income

5

Obvious clusters

6

Only income

7

Only age

8

Summary

  • Using just one variable can be misleading.
9

Summary

  • Using just one variable can be misleading.
  • When there are more than 2 variables just looking at a scatterplot doesn’t work.
9

Summary

  • Using just one variable can be misleading.
  • When there are more than 2 variables just looking at a scatterplot doesn’t work.
  • Instead algorithms can be used to find the clusters in a sensible way, even in high dimensions.
9

Real Example 1

  • The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
10

Real Example 1

  • The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
  • There are 32 cars which are measured on 11 variables such as miles per gallon, number of cylinders, horsepower and weight.
10

Real Example 1

  • The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends
  • There are 32 cars which are measured on 11 variables such as miles per gallon, number of cylinders, horsepower and weight.
  • It can be loaded into the workspace using the command data(mtcars)
10

MT Cars data

MakeModel mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
11

Dendrogram

12

Real Example 2

  • A business to business example with 440 customers of a wholesaler
13

Real Example 2

  • A business to business example with 440 customers of a wholesaler
  • The variables are annual spend in the following 6 categories:
13

Real Example 2

  • A business to business example with 440 customers of a wholesaler
  • The variables are annual spend in the following 6 categories:
    • Fresh food
    • Milk
    • Groceries
    • Frozen
    • Detergents/Paper
    • Delicatessen
13

Real Example 2

  • A business to business example with 440 customers of a wholesaler
  • The variables are annual spend in the following 6 categories:
    • Fresh food
    • Milk
    • Groceries
    • Frozen
    • Detergents/Paper
    • Delicatessen
  • These data are available on Moodle.
13

Cluster centroids

After clustering we get the following cluster means.

Cluster Fresh Milk Grocery Frozen Detergents_Paper Delicassen
1 35941 6044 6289 6714 1040 3049
2 8253 3825 5280 2573 1773 1137
3 8000 18511 27574 1997 12407 2252
The clusters may represent hotels, supermarkets and cafes.
14

Approaches to Clustering

  • Hierarchical: Path of solutions:
15

Approaches to Clustering

  • Hierarchical: Path of solutions:
    • Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.
15

Approaches to Clustering

  • Hierarchical: Path of solutions:
    • Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.
    • Divisive: At start all observations in one cluster. Split step by step until each observation is in its own cluster.
15

Approaches to Clustering

  • Hierarchical: Path of solutions:
    • Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.
    • Divisive: At start all observations in one cluster. Split step by step until each observation is in its own cluster.
  • Non-hierarchical: Choose the number of clusters ex ante. No merging or splitting.
15

Our focus

  • Our main focus will be on agglomerative hierarchical methods.
16

Our focus

  • Our main focus will be on agglomerative hierarchical methods.
  • Divisive hierarchical methods are very slow and we do not cover them at all.
16

Our focus

  • Our main focus will be on agglomerative hierarchical methods.
  • Divisive hierarchical methods are very slow and we do not cover them at all.
  • We consider one example of a non-hierarchical method known as the k-means algorithm.
16

Definition of Clustering

  • Oxford Dictionary: A group of similar things or people positioned or occurring closely together
17

Definition of Clustering

  • Oxford Dictionary: A group of similar things or people positioned or occurring closely together
  • Collins Dictionary: A number of things growing, fastened, or occurring close together
17

Definition of Clustering

  • Oxford Dictionary: A group of similar things or people positioned or occurring closely together
  • Collins Dictionary: A number of things growing, fastened, or occurring close together
  • Note the importance of closeness or distance. We need two concepts of distance
17

Definition of Clustering

  • Oxford Dictionary: A group of similar things or people positioned or occurring closely together
  • Collins Dictionary: A number of things growing, fastened, or occurring close together
  • Note the importance of closeness or distance. We need two concepts of distance
    1. Distance between observations.
    2. Distance between clusters.
17

A distance between clusters

  • Let A be a cluster with observations {a1,a2,,aI} and B be a cluster with points {b1,b2,,bJ}.
18

A distance between clusters

  • Let A be a cluster with observations {a1,a2,,aI} and B be a cluster with points {b1,b2,,bJ}.
  • The calligraphic script A or B denotes a cluster with possibly more than one point.
18

A distance between clusters

  • Let A be a cluster with observations {a1,a2,,aI} and B be a cluster with points {b1,b2,,bJ}.
  • The calligraphic script A or B denotes a cluster with possibly more than one point.
  • The bold scipt ai or bj denotes a vector of attributes (e.g. age and income) for each observation.
18

A distance between clusters

  • Let A be a cluster with observations {a1,a2,,aI} and B be a cluster with points {b1,b2,,bJ}.
  • The calligraphic script A or B denotes a cluster with possibly more than one point.
  • The bold scipt ai or bj denotes a vector of attributes (e.g. age and income) for each observation.
  • Rather than vectors, it is much easier to think of each observation as a point in a scatterplot.
18

Single Linkage

One way of defining the distance between clusters A and B is

D(A,B)=mini,jD(ai,bj)

This is called single linkage or nearest neighbour.

19

Single Linkage

20

Single Linkage

21

Complete Linkage

Another way of defining the distance between A and B is

D(A,B)=maxi,jD(ai,bj)

This is called complete linkage or furthest neighbour.

22

Complete Linkage

23

Complete Linkage

24

Complete linkage

  • In the previous example all points in the red cluster are within a distance of 160.01 of all points in the blue cluster.
  • This is why it is called complete linkage.
25

A simple example

  • Over the next couple of slides we will go through the entire process of agglomerative clustering
26

A simple example

  • Over the next couple of slides we will go through the entire process of agglomerative clustering
    • We will use Euclidean distance to define distance between points
26

A simple example

  • Over the next couple of slides we will go through the entire process of agglomerative clustering
    • We will use Euclidean distance to define distance between points
    • We will use single linkage to define the distance between clusters
26

A simple example

  • Over the next couple of slides we will go through the entire process of agglomerative clustering
    • We will use Euclidean distance to define distance between points
    • We will use single linkage to define the distance between clusters
  • There are only five observations and two variables
26

Agglomerative clustering

27

Agglomerative clustering

28

Agglomerative clustering

29

Agglomerative clustering

30

Agglomerative clustering

31

Agglomerative clustering

32

Agglomerative clustering

33

Agglomerative clustering

34

Agglomerative clustering

35

Agglomerative clustering

36

Hierarchical Clustering

  • 5-cluster solution A and B and C and D and E
37

Hierarchical Clustering

  • 5-cluster solution A and B and C and D and E
  • 4-cluster solution {A,D} and B and C and E
37

Hierarchical Clustering

  • 5-cluster solution A and B and C and D and E
  • 4-cluster solution {A,D} and B and C and E
  • 3-cluster solution {A,D} and {B, C} and E
37

Hierarchical Clustering

  • 5-cluster solution A and B and C and D and E
  • 4-cluster solution {A,D} and B and C and E
  • 3-cluster solution {A,D} and {B, C} and E
  • 2-cluster solution {A,B, C,D} and E
37

Hierarchical Clustering

  • 5-cluster solution A and B and C and D and E
  • 4-cluster solution {A,D} and B and C and E
  • 3-cluster solution {A,D} and {B, C} and E
  • 2-cluster solution {A,B, C,D} and E
  • 1-cluster solution {A,B, C,D E}
37

Dendrogram

  • The Dendrogram is a useful tool for analysing a cluster solution.
38

Dendrogram

  • The Dendrogram is a useful tool for analysing a cluster solution.
    • Observations are on one axis (usually x)
38

Dendrogram

  • The Dendrogram is a useful tool for analysing a cluster solution.
    • Observations are on one axis (usually x)
    • The distance between clusters is on other axis (usually y).
38

Dendrogram

  • The Dendrogram is a useful tool for analysing a cluster solution.
    • Observations are on one axis (usually x)
    • The distance between clusters is on other axis (usually y).
    • From the Dendrogram one can see the order in which the clusters are merged.
38

Dendrogram

39

Interpretation of Dendrogram

  • Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
40

Interpretation of Dendrogram

  • Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
  • If the distance between two clusters is within the tolerance they are merged into one cluster.
40

Interpretation of Dendrogram

  • Think of the axis with distance (y-axis) as the measuring a 'tolerance level'
  • If the distance between two clusters is within the tolerance they are merged into one cluster.
  • As tolerance increases more and more clusters are merged leading to less clusters overall.
40

Clustering in R

  • Clustering in R requires at most 3 steps
41

Clustering in R

  • Clustering in R requires at most 3 steps
    1. Standardise the data if they are in different units (using the function scale)
41

Clustering in R

  • Clustering in R requires at most 3 steps
    1. Standardise the data if they are in different units (using the function scale)
    2. Find the distance between all pairs of observations (using the function dist)
41

Clustering in R

  • Clustering in R requires at most 3 steps
    1. Standardise the data if they are in different units (using the function scale)
    2. Find the distance between all pairs of observations (using the function dist)
    3. Cluster the data using the function hclust
41

Clustering in R

  • Clustering in R requires at most 3 steps
    1. Standardise the data if they are in different units (using the function scale)
    2. Find the distance between all pairs of observations (using the function dist)
    3. Cluster the data using the function hclust
  • Try this with the mtcars dataset. Use Euclidean distance and complete linkage.
  • Store the result of hclust in a variable called CarsCluster.
41

Clustering in R

data(mtcars)
mtcars%>%
scale%>%
dist%>%
hclust(method="complete")->
CarsCluster
42

Dendrogram in R

plot(CarsCluster,cex=0.5)

43

Identifying clusters

CarsCluster%>%plot(cex=0.5)
CarsCluster%>%rect.hclust(k=2)

44

Dendrogram in R

For an interactive tool try:

identify(CarsCluster)

Press the escape key when you are finished.

45

Choosing the number of clusters

46

Choosing clusters

  • Although hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
47

Choosing clusters

  • Although hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
  • There is no correct number of clusters. Choosing the number of clusters depends on the context.
47

Choosing clusters

  • Although hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions.
  • There is no correct number of clusters. Choosing the number of clusters depends on the context.
  • There are however poor choices for the number of clusters.
47

Choosing clusters

  • Do not choose too many clusters:
48

Choosing clusters

  • Do not choose too many clusters:
    • A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
48

Choosing clusters

  • Do not choose too many clusters:
    • A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
  • Do not choose too few clusters:
48

Choosing clusters

  • Do not choose too many clusters:
    • A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies.
  • Do not choose too few clusters:
    • If you choose the 1-cluster solution there is no point in doing clustering at all.
48

Using dendrogram

  • One criterion is that the number of clusters is stable over a wide range of tolerance.
49

Using dendrogram

  • One criterion is that the number of clusters is stable over a wide range of tolerance.
  • The plot on the next slide shows a 3 cluster solution.
49

Three cluster solution

50

Stability

  • The tolerance for a three cluster solution is about 5.9.
51

Stability

  • The tolerance for a three cluster solution is about 5.9.
  • If the tolerance is increased by a very small amount then we will have a two cluster solution.
51

Stability

  • The tolerance for a three cluster solution is about 5.9.
  • If the tolerance is increased by a very small amount then we will have a two cluster solution.
  • If the tolerance is decreased by a very small amount then we will have a four cluster solution.
51

Two cluster solution

52

Four cluster solution

53

Stability

  • In the previous example
54

Stability

  • In the previous example
    • The three cluster solution is not stable
54

Stability

  • In the previous example
    • The three cluster solution is not stable
    • The two and four cluster solutions are stable
54

Stability

  • In the previous example
    • The three cluster solution is not stable
    • The two and four cluster solutions are stable
  • In general look for a long stretch of tolerance, over which the number of clusters does not change.
54

Extracting the clusters

For a given number of clusters we can create a new variable indicating cluster membership via the cutree function.

mem<-cutree(CarsCluster,2)
x
Mazda RX4 1
Mazda RX4 Wag 1
Datsun 710 2
Hornet 4 Drive 2
Hornet Sportabout 1
Valiant 2
Duster 360 1
Merc 240D 2
Merc 230 2
Merc 280 2
Merc 280C 2
Merc 450SE 1
Merc 450SL 1
Merc 450SLC 1
Cadillac Fleetwood 1
Lincoln Continental 1
Chrysler Imperial 1
Fiat 128 2
Honda Civic 2
Toyota Corolla 2
Toyota Corona 2
Dodge Challenger 1
AMC Javelin 1
Camaro Z28 1
Pontiac Firebird 1
Fiat X1-9 2
Porsche 914-2 2
Lotus Europa 2
Ford Pantera L 1
Ferrari Dino 1
Maserati Bora 1
Volvo 142E 2
55

Pros and Cons of Single Linkage

  • Pros:
    • Single linkage is very easy to understand.
    • Single linkage is a very fast algorithm.
56

Pros and Cons of Single Linkage

  • Pros:
    • Single linkage is very easy to understand.
    • Single linkage is a very fast algorithm.
  • Cons:
    • Single linkage is very sensitive to single observations which leads to chaining.
    • Complete linkage avoids this problem and gives more compact clusters with a similar diameter.
56

Chaining

57

Single Linkage Dendrogram

58

Single Linkage

59

Add one observation

60

New solution

61

Dendrogram with Chaining

62

Robustness

  • In general adding a single observation should not dramatically change the analysis.
63

Robustness

  • In general adding a single observation should not dramatically change the analysis.
  • In this instance the new observation was not even an outlier.
63

Robustness

  • In general adding a single observation should not dramatically change the analysis.
  • In this instance the new observation was not even an outlier.
  • A term used for such an observation is an inlier.
63

Robustness

  • In general adding a single observation should not dramatically change the analysis.
  • In this instance the new observation was not even an outlier.
  • A term used for such an observation is an inlier.
  • Methods that are not affected by single observations are often called robust.
63

Robustness

  • In general adding a single observation should not dramatically change the analysis.
  • In this instance the new observation was not even an outlier.
  • A term used for such an observation is an inlier.
  • Methods that are not affected by single observations are often called robust.
  • Let's see if complete linkage is robust to the inlier.
63

Complete Linkage

64

Complete Linkage: Dendrogram

65

Disadvantages of CL

  • Complete Linkage overcomes chaining and is robust to inliers
66

Disadvantages of CL

  • Complete Linkage overcomes chaining and is robust to inliers
  • However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
66

Disadvantages of CL

  • Complete Linkage overcomes chaining and is robust to inliers
  • However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
  • The following methods are more robust and should be preferred
66

Disadvantages of CL

  • Complete Linkage overcomes chaining and is robust to inliers
  • However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.
  • The following methods are more robust and should be preferred
    • Average Linkage
    • Centroid Method
    • Ward’s Method
66

Average Linkage

The distance between two clusters can be defined so that it is based on all the pairwise distances between the elements of each cluster. D(A,B)=1|A||B|i=1|A|j=1|B|D(ai,bj) Here |A| is the number of observations in cluster A and |B| is the number of observations in cluster B

67

Average Linkage

  • Average linkage can be called different things
68

Average Linkage

  • Average linkage can be called different things
    • Between groups method.
    • Unweighted Pair Group Method with Arithmetic mean (UPGMA)
68

Pairwise distances (one obs.)

69

All pairwise distances

70

Centroid Method

  • The centroid of a cluster can be defined as the mean of all the points in the cluster.
71

Centroid Method

  • The centroid of a cluster can be defined as the mean of all the points in the cluster.
  • If A is a cluster containing the observations a then the centroid of A is given by.
71

Centroid Method

  • The centroid of a cluster can be defined as the mean of all the points in the cluster.
  • If A is a cluster containing the observations a then the centroid of A is given by. a¯=1|A|aiAai
71

Centroid Method

  • The centroid of a cluster can be defined as the mean of all the points in the cluster.
  • If A is a cluster containing the observations a then the centroid of A is given by. a¯=1|A|aiAai
  • The distance between two clusters can then be defined as the distance between the respective centroids.
71

Vector mean

  • Recall that ai is a vector of attributes, e.g income and age.
72

Vector mean

  • Recall that ai is a vector of attributes, e.g income and age.
  • In this case a¯ is also a vector of attributes.
72

Vector mean

  • Recall that ai is a vector of attributes, e.g income and age.
  • In this case a¯ is also a vector of attributes.
  • Each element of a¯ is the mean of a different attribute, e.g. mean income, mean age.
72

Centroid method

73

Centroid method

74

Average Linkage v Centroid

  • Consider an example with one variable (although everything works with vectors too).
75

Average Linkage v Centroid

  • Consider an example with one variable (although everything works with vectors too).
  • Suppose we have the clusters A={0,2} and B={3,5}
75

Average Linkage v Centroid

  • Consider an example with one variable (although everything works with vectors too).
  • Suppose we have the clusters A={0,2} and B={3,5}
  • Find the distance A and B using
75

Average Linkage v Centroid

  • Consider an example with one variable (although everything works with vectors too).
  • Suppose we have the clusters A={0,2} and B={3,5}
  • Find the distance A and B using
    • Average Linkage
    • Centroid Method
75

Average Linkage

  • Must find distances between all pairs of observations
76

Average Linkage

  • Must find distances between all pairs of observations
    • D(a1,b1)=3
    • D(a1,b2)=5
    • D(a2,b1)=1
    • D(a2,b2)=3
76

Average Linkage

  • Must find distances between all pairs of observations
    • D(a1,b1)=3
    • D(a1,b2)=5
    • D(a2,b1)=1
    • D(a2,b2)=3
  • Averaging these, the distance is 3.
76

Centroid method

  • First find centroids
77

Centroid method

  • First find centroids
    • a¯=1
    • b¯=4
77

Centroid method

  • First find centroids
    • a¯=1
    • b¯=4
  • The distance is 3.
77

Centroid method

  • First find centroids
    • a¯=1
    • b¯=4
  • The distance is 3.
  • Here both methods give the same answer but when vectors are used instead they do not give the same answer in general.
77

Average Linkage v Centroid

  • In average linkage
78

Average Linkage v Centroid

  • In average linkage
    1. Compute the distances between pairs of observations
    2. Average these distances
78

Average Linkage v Centroid

  • In average linkage
    1. Compute the distances between pairs of observations
    2. Average these distances
  • In the centroid method
78

Average Linkage v Centroid

  • In average linkage
    1. Compute the distances between pairs of observations
    2. Average these distances
  • In the centroid method
    1. Average the observations to obtain the centroid of each cluster.
    2. Find the distance between centroids
78

Ward's method

  • All methods so far, merge two clusters when the distance between them is small.
79

Ward's method

  • All methods so far, merge two clusters when the distance between them is small.
  • Ward’s method merges two clusters to minimise within cluster variance.
79

Ward's method

  • All methods so far, merge two clusters when the distance between them is small.
  • Ward’s method merges two clusters to minimise within cluster variance.
  • Two variations implemented in R.
79

Ward's method

  • All methods so far, merge two clusters when the distance between them is small.
  • Ward’s method merges two clusters to minimise within cluster variance.
  • Two variations implemented in R.
    • Ward.D2 is the same as the original Ward paper.
    • Ward.D is actually based on a mistake but can still work quite well.
79

Within Cluster Variance

  • The within-cluster variance for a cluster A is defined as

Vw(A)=1|A|1S(A)

where S(A)=aiA[(aia¯)(aia¯)]

80

Vector notation

  • The term S(A)=aiA(aia¯)(aia¯) uses vector notation, but the idea is simple.
81

Vector notation

  • The term S(A)=aiA(aia¯)(aia¯) uses vector notation, but the idea is simple.
  • Take the difference of each attribute from its mean (e.g. income, age, etc.)
81

Vector notation

  • The term S(A)=aiA(aia¯)(aia¯) uses vector notation, but the idea is simple.
  • Take the difference of each attribute from its mean (e.g. income, age, etc.)
  • Then square them and add together over attributes and observations.
81

Vector notation

  • The term S(A)=aiA(aia¯)(aia¯) uses vector notation, but the idea is simple.
  • Take the difference of each attribute from its mean (e.g. income, age, etc.)
  • Then square them and add together over attributes and observations.
  • The within cluster variance is a total variance across all attributes.
81

Ward's algorithm

  • At each step we must merge two clusters to form a single cluster.
82

Ward's algorithm

  • At each step we must merge two clusters to form a single cluster.
  • Suppose we pick a cluster A and B to form a new cluster C.
82

Ward's algorithm

  • At each step we must merge two clusters to form a single cluster.
  • Suppose we pick a cluster A and B to form a new cluster C.
  • Ward's algorithm chooses A and B so that VW(C) is as small as possible.
82

Non-hierarchical Clustering

83

Non-hierarchical Clustering

  • In some analyses the exact number of clusters may be known.
84

Non-hierarchical Clustering

  • In some analyses the exact number of clusters may be known.
  • If so non-hierachical clustering may be used.
84

Non-hierarchical Clustering

  • In some analyses the exact number of clusters may be known.
  • If so non-hierachical clustering may be used.
  • Perhaps the most widely used non-hierarchical method is k-means clustering.
84

k-means

  • In general k-means seeks to find k clusters.
85

k-means

  • In general k-means seeks to find k clusters.
  • The following condition must be satisfied:
85

k-means

  • In general k-means seeks to find k clusters.
  • The following condition must be satisfied:
    • Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.
85

k-means

  • In general k-means seeks to find k clusters.
  • The following condition must be satisfied:
    • Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.
  • Knowing the partition into clusters determines the mean.
85

k-means

  • In general k-means seeks to find k clusters.
  • The following condition must be satisfied:
    • Each point in a must be closer to its own cluster centroid rather than the centroid of a different cluster.
  • Knowing the partition into clusters determines the mean.
  • Knowing the means determines the clusters.
85

Optimality

  • The objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
86

Optimality

  • The objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
  • Let C={C1,,Ck} be a partitioning of all points into k clusters.
86

Optimality

  • The objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares.
  • Let C={C1,,Ck} be a partitioning of all points into k clusters.
  • The objective of k-means is to find argminCh=1kS(Ch)
86

NP hard

  • It is an example of an NP-hard problem
87

NP hard

  • It is an example of an NP-hard problem
  • The bad news is that NP-hard problems cannot be easily solved by computers.
87

NP hard

  • It is an example of an NP-hard problem
  • The bad news is that NP-hard problems cannot be easily solved by computers.
  • The good news is that your credit card security also relies on the fact that computers cannot easily solve NP-hard problems.
87

Heuristic

  • Fortunately there are algorithms that provide a reasonably good solution to the k-means problem.
88

Heuristic

  • Fortunately there are algorithms that provide a reasonably good solution to the k-means problem.
  • In some cases they may provide the exact solution, although there are no guarantees.
88

Heuristic

  • Fortunately there are algorithms that provide a reasonably good solution to the k-means problem.
  • In some cases they may provide the exact solution, although there are no guarantees.
  • We will now cover Lloyd's algorithm which provides good intuition into the k-means problem.
88

Heuristic

  • Fortunately there are algorithms that provide a reasonably good solution to the k-means problem.
  • In some cases they may provide the exact solution, although there are no guarantees.
  • We will now cover Lloyd's algorithm which provides good intuition into the k-means problem.
  • By default, R implements the more sophisticated (and complicated) Hartigan Wong algorithm.
88

Lloyd's algorithm

  1. Choose initial centroids (possibly at random).
89

Lloyd's algorithm

  1. Choose initial centroids (possibly at random).
  2. Allocate each observation to cluster corresponding with nearest centroid
89

Lloyd's algorithm

  1. Choose initial centroids (possibly at random).
  2. Allocate each observation to cluster corresponding with nearest centroid
  3. Re-compute centroids as the mean of all observations in the cluster
89

Lloyd's algorithm

  1. Choose initial centroids (possibly at random).
  2. Allocate each observation to cluster corresponding with nearest centroid
  3. Re-compute centroids as the mean of all observations in the cluster
  4. Repeat steps 2 and 3 until convergence
89

Raw Data

90

Initial Centroids

91

Initial Allocation

92

Re-compute Centroids

93

Reallocate

94

Reallocate

95

Recompute Centroids

96

Reallocate

97

Reallocate

98

Stable solution

99

Wholesaler Data

  • Recall the Wholesaler data from earlier in the lecture
100

Wholesaler Data

  • Recall the Wholesaler data from earlier in the lecture
  • The variables are annual spend in 6 categories.
100

Wholesaler Data

  • Recall the Wholesaler data from earlier in the lecture
  • The variables are annual spend in 6 categories.
  • Should the data be standardised?
100

Wholesaler Data

  • Recall the Wholesaler data from earlier in the lecture
  • The variables are annual spend in 6 categories.
  • Should the data be standardised?
  • Try to carry out k means clustering using the R function kmeans
100

Wholesaler Data

  • Recall the Wholesaler data from earlier in the lecture
  • The variables are annual spend in 6 categories.
  • Should the data be standardised?
  • Try to carry out k means clustering using the R function kmeans
  • Find a solution with 3 clusters.
100

k-means in R

To do a three cluster solution

WholesaleCluster<-kmeans(Wholesale,3)

If the data are in a data.frame you may need to select the numeric variables.

101

R output

  • The result of the R function kmeans will be a list containing several entries. The most interesting are
102

R output

  • The result of the R function kmeans will be a list containing several entries. The most interesting are
    • A variable indicating cluster membership is given in cluster
102

R output

  • The result of the R function kmeans will be a list containing several entries. The most interesting are
    • A variable indicating cluster membership is given in cluster
    • The centroids for each cluster are given in centers
102

R output

  • The result of the R function kmeans will be a list containing several entries. The most interesting are
    • A variable indicating cluster membership is given in cluster
    • The centroids for each cluster are given in centers
    • The number of observations in each cluster is given by size
102

R output

  • The result of the R function kmeans will be a list containing several entries. The most interesting are
    • A variable indicating cluster membership is given in cluster
    • The centroids for each cluster are given in centers
    • The number of observations in each cluster is given by size
    • The cluster centroids can be useful for profiling the clusters.
102

Cluster Centroids

Fresh Milk Grocery Frozen Detergents_Paper Delicassen
8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
103

Robustness Check

Since values are sensitive to starting values, we can run the algorithm with many different starting values using the nstart option

WholesaleCluster<-kmeans(Wholesale,3,nstart = 25)
Fresh Milk Grocery Frozen Detergents_Paper Delicassen
35941.40 6044.450 6288.617 6713.967 1039.667 3049.467
8000.04 18511.420 27573.900 1996.680 12407.360 2252.020
8253.47 3824.603 5280.455 2572.661 1773.058 1137.497
104

Label switching

  • Two slides back the second cluster had the highest spend on fresh food.
105

Label switching

  • Two slides back the second cluster had the highest spend on fresh food.
  • One slide back the first cluster that had the highest spend on fresh food.
105

Label switching

  • Two slides back the second cluster had the highest spend on fresh food.
  • One slide back the first cluster that had the highest spend on fresh food.
  • The centroids were identical, they were just flipped around. This is called Label switching.
105

Label switching

  • Two slides back the second cluster had the highest spend on fresh food.
  • One slide back the first cluster that had the highest spend on fresh food.
  • The centroids were identical, they were just flipped around. This is called Label switching.
  • It does not matter which cluster is first, second or third. The means are important.
105

Number of clusters

  • The motivation of k means clustering is that the number of clusters is already known.
106

Number of clusters

  • The motivation of k means clustering is that the number of clusters is already known.
  • In principal different choices of k can be used and compared to one another.
106

Number of clusters

  • The motivation of k means clustering is that the number of clusters is already known.
  • In principal different choices of k can be used and compared to one another.
  • However, unlike hierarchical clustering, these different solutions can contradict one another.
106

The meaning of non hierarchical

  • Consider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.
107

The meaning of non hierarchical

  • Consider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.
    • If two variables are in the same cluster in Solution B then they will be in the same cluster in Solution A
107

The meaning of non hierarchical

  • Consider the two cluster solution (Solution A) and three cluster solution (Solution B) for hierarchical clustering.
    • If two variables are in the same cluster in Solution B then they will be in the same cluster in Solution A
  • The same is not true for non-hierarchical clustering including k-means clustering.
107

Hierarchical Clustering

Together we will use Ward's method to do hierarchical clustering on the Wholesale data and get the cluster membership from the two and three cluster solutions.

Then you can try the same for k-means

108

Solution

Wholesale%>%
dist%>%
hclust(method='ward.D2')->hiercl
cl2<-cutree(hiercl,2)
cl3<-cutree(hiercl,3)
table(cl2,cl3)
1 2 3
1 261 0 45
2 0 134 0
109

Same exercise for k-means

km2<-kmeans(Wholesale,2)
kmcl2<-km2$cluster
km3<-kmeans(Wholesale,3)
kmcl3<-km3$cluster
table(kmcl2,kmcl3)
1 2 3
1 0 59 6
2 330 1 44
110

Non-hierarchical

  • Consider the observations in Cluster 3 when k=3. When we go from k=3 to k=2
111

Non-hierarchical

  • Consider the observations in Cluster 3 when k=3. When we go from k=3 to k=2
    • There are 6 of these observations that go to the new cluster 1.
111

Non-hierarchical

  • Consider the observations in Cluster 3 when k=3. When we go from k=3 to k=2
    • There are 6 of these observations that go to the new cluster 1.
    • The remaining 44 observations go to the new cluster 2.
111

Non-hierarchical

  • Consider the observations in Cluster 3 when k=3. When we go from k=3 to k=2
    • There are 6 of these observations that go to the new cluster 1.
    • The remaining 44 observations go to the new cluster 2.
  • Notice that there is some label switching as well.
111

Comparing Cluster solutions

112

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
113

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
    • In forecasting compare forecasts to outcomes.
113

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
    • In forecasting compare forecasts to outcomes.
    • In regression look at goodness of fit.
113

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
    • In forecasting compare forecasts to outcomes.
    • In regression look at goodness of fit.
  • There is also very little theory to guide us.
113

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
    • In forecasting compare forecasts to outcomes.
    • In regression look at goodness of fit.
  • There is also very little theory to guide us.
    • In regression we know least squares is BLUE under certain assumptions.
113

Comparing Cluster solutions

  • A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.
    • In forecasting compare forecasts to outcomes.
    • In regression look at goodness of fit.
  • There is also very little theory to guide us.
    • In regression we know least squares is BLUE under certain assumptions.
  • How do we choose a clustering algorithm?
113

Choosing a method

  • There is no ideal method to do hierarchical clustering.
114

Choosing a method

  • There is no ideal method to do hierarchical clustering.
  • A good strategy is to try a few different methods.
114

Choosing a method

  • There is no ideal method to do hierarchical clustering.
  • A good strategy is to try a few different methods.
  • If there is a clear structure in the data then most methods will give similar results.
    • It is not unusual to find one method yielding very different results.
114

Choosing a method

  • There is no ideal method to do hierarchical clustering.
  • A good strategy is to try a few different methods.
  • If there is a clear structure in the data then most methods will give similar results.
    • It is not unusual to find one method yielding very different results.
  • If all methods give vastly different results then perhaps there are no clear clusters in the data.
114

Robustness

  • We can check if a clustering solution is robust to different algorithms.
115

Robustness

  • We can check if a clustering solution is robust to different algorithms.
  • For example if the centroid method, average linkage, Ward method and k-means all give similar clusters then we can be confident that the clusters are truly a feature of the data.
115

Robustness

  • We can check if a clustering solution is robust to different algorithms.
  • For example if the centroid method, average linkage, Ward method and k-means all give similar clusters then we can be confident that the clusters are truly a feature of the data.
  • One way to evaluate this is to look at the Rand Index.
115

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
116

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
  • Pick two observations at random x and y.
116

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
  • Pick two observations at random x and y.
    1. x and y are in the same cluster in Solution A and the same cluster in Solution B
116

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
  • Pick two observations at random x and y.
    1. x and y are in the same cluster in Solution A and the same cluster in Solution B
    2. x and y are in different clusters in Solution A and different clusters in Solution B
116

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
  • Pick two observations at random x and y.
    1. x and y are in the same cluster in Solution A and the same cluster in Solution B
    2. x and y are in different clusters in Solution A and different clusters in Solution B
    3. x and y are in the same cluster in Solution A and the different cluster in Solution B
116

Rand Index

  • Suppose we have two cluster solutions, Solution A and Solution B.
  • Pick two observations at random x and y.
    1. x and y are in the same cluster in Solution A and the same cluster in Solution B
    2. x and y are in different clusters in Solution A and different clusters in Solution B
    3. x and y are in the same cluster in Solution A and the different cluster in Solution B
    4. x and y are in different clusters in Solution A and same clusters in Solution B
116

Rand Index

  • Scenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
117

Rand Index

  • Scenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
  • Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
117

Rand Index

  • Scenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
  • Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
  • The Rand Index gives the probability of picking two observations at random that are in agreement.
117

Rand Index

  • Scenario 1 and scenario 2 both suggest that the cluster solutions are in agreement
  • Scenario 3 and scenario 4 both suggest that the cluster solutions are in disagreement
  • The Rand Index gives the probability of picking two observations at random that are in agreement.
  • The Rand Index lies between 0 and 1 and higher numbers indicate agreement.
117

Adjusted Rand Index

  • Even if observations are clustered at random, there will still be some agreement due to chance.
118

Adjusted Rand Index

  • Even if observations are clustered at random, there will still be some agreement due to chance.
  • The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
118

Adjusted Rand Index

  • Even if observations are clustered at random, there will still be some agreement due to chance.
  • The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
  • It is still only equal to 1 if the two clustering solutions are in perfect agreement.
118

Adjusted Rand Index

  • Even if observations are clustered at random, there will still be some agreement due to chance.
  • The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.
  • It is still only equal to 1 if the two clustering solutions are in perfect agreement.
  • The adjusted Rand Index can be computed using the adjustedRandIndex function in the package mclust
118

Conclusion

  • There are many methods for clustering.
119

Conclusion

  • There are many methods for clustering.
  • For this reason a cluster analysis should be carried out carefully and transparently.
119

Conclusion

  • There are many methods for clustering.
  • For this reason a cluster analysis should be carried out carefully and transparently.
  • Although we have focused on algorithms in the lecture, remember that the objective of cluster analysis is to explore the data.
119

Conclusion

  • There are many methods for clustering.
  • For this reason a cluster analysis should be carried out carefully and transparently.
  • Although we have focused on algorithms in the lecture, remember that the objective of cluster analysis is to explore the data.
  • As such remember to profile the clusters and to provide insight into what these clusters may represent.
119

Why Clustering?

2
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow