class: center, middle, inverse, title-slide # Cluster Analysis ## High Dimensional Data Analysis ### Anastasios Panagiotelis & Ruben Loaiza-Maya ### Lecture 4 --- class: inverse, center, middle # Why Clustering? --- # Market Segmentation - A common strategy in marketing is to analyse different segments of the market.<!--D--> -- - Sometimes the purpose is to segment based on a single variable:<!--D--> -- + Gender<!--D--> -- + Age<!--D--> -- + Income<!--D--> -- - An alternative is to segment using all available information --- # A 2-dimensional example - Consider that data is collected for customers’ *age* and *income*.<!--D--> -- - These can be plotted on a scatterplot to see if any obvious segments or clusters are present.<!--D--> -- - The following data are not real data but are simulated --- # Age v Income <img src="Clustering_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Obvious clusters <img src="Clustering_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Only income <img src="Clustering_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Only age <img src="Clustering_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Summary - Using just one variable can be misleading.<!--D--> -- - When there are more than 2 variables just looking at a scatterplot doesn’t work.<!--D--> -- - Instead algorithms can be used to find the clusters in a sensible way, even in high dimensions. --- # Real Example 1 - The dataset mtcars is an R dataset that originally came from a 1974 magazine called Motor Trends<!--D--> -- - There are 32 cars which are measured on 11 variables such as miles per gallon, number of cylinders, horsepower and weight.<!--D--> -- - It can be loaded into the workspace using the command `data(mtcars)` --- # MT Cars data <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> MakeModel </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> mpg </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> cyl </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> disp </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> hp </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> drat </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> wt </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> qsec </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> vs </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> am </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> gear </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> carb </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 160.0 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.90 </td> <td style="text-align:right;"> 2.620 </td> <td style="text-align:right;"> 16.46 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:right;"> 21.0 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 160.0 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.90 </td> <td style="text-align:right;"> 2.875 </td> <td style="text-align:right;"> 17.02 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 108.0 </td> <td style="text-align:right;"> 93 </td> <td style="text-align:right;"> 3.85 </td> <td style="text-align:right;"> 2.320 </td> <td style="text-align:right;"> 18.61 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:right;"> 21.4 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 258.0 </td> <td style="text-align:right;"> 110 </td> <td style="text-align:right;"> 3.08 </td> <td style="text-align:right;"> 3.215 </td> <td style="text-align:right;"> 19.44 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Hornet Sportabout </td> <td style="text-align:right;"> 18.7 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 360.0 </td> <td style="text-align:right;"> 175 </td> <td style="text-align:right;"> 3.15 </td> <td style="text-align:right;"> 3.440 </td> <td style="text-align:right;"> 17.02 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Valiant </td> <td style="text-align:right;"> 18.1 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 225.0 </td> <td style="text-align:right;"> 105 </td> <td style="text-align:right;"> 2.76 </td> <td style="text-align:right;"> 3.460 </td> <td style="text-align:right;"> 20.22 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Duster 360 </td> <td style="text-align:right;"> 14.3 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 360.0 </td> <td style="text-align:right;"> 245 </td> <td style="text-align:right;"> 3.21 </td> <td style="text-align:right;"> 3.570 </td> <td style="text-align:right;"> 15.84 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> </tr> <tr> <td style="text-align:left;"> Merc 240D </td> <td style="text-align:right;"> 24.4 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 146.7 </td> <td style="text-align:right;"> 62 </td> <td style="text-align:right;"> 3.69 </td> <td style="text-align:right;"> 3.190 </td> <td style="text-align:right;"> 20.00 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Merc 230 </td> <td style="text-align:right;"> 22.8 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 140.8 </td> <td style="text-align:right;"> 95 </td> <td style="text-align:right;"> 3.92 </td> <td style="text-align:right;"> 3.150 </td> <td style="text-align:right;"> 22.90 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table></div> --- # Dendrogram <img src="Clustering_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Real Example 2 - A business to business example with 440 customers of a wholesaler<!--D--> -- - The variables are annual spend in the following 6 categories:<!--D--> -- + Fresh food + Milk + Groceries + Frozen + Detergents/Paper + Delicatessen<!--D--> -- - These data are available on Moodle. --- # Cluster centroids After clustering we get the following cluster means. <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:200px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> Cluster </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Fresh </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Milk </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Grocery </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Frozen </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Detergents_Paper </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Delicassen </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 35941 </td> <td style="text-align:right;"> 6044 </td> <td style="text-align:right;"> 6289 </td> <td style="text-align:right;"> 6714 </td> <td style="text-align:right;"> 1040 </td> <td style="text-align:right;"> 3049 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 8253 </td> <td style="text-align:right;"> 3825 </td> <td style="text-align:right;"> 5280 </td> <td style="text-align:right;"> 2573 </td> <td style="text-align:right;"> 1773 </td> <td style="text-align:right;"> 1137 </td> </tr> <tr> <td style="text-align:left;"> 3 </td> <td style="text-align:right;"> 8000 </td> <td style="text-align:right;"> 18511 </td> <td style="text-align:right;"> 27574 </td> <td style="text-align:right;"> 1997 </td> <td style="text-align:right;"> 12407 </td> <td style="text-align:right;"> 2252 </td> </tr> </tbody> </table></div> The clusters may represent hotels, supermarkets and cafes. --- # Approaches to Clustering - Hierarchical: Path of solutions:<!--D--> -- + Agglomerative: At start every observation is a cluster. Merge the most similar clusters step by step until all observations in one cluster.<!--D--> -- + Divisive: At start all observations in one cluster. Split step by step until each observation is in its own cluster.<!--D--> -- - Non-hierarchical: Choose the number of clusters ex ante. No merging or splitting. --- # Our focus - Our main focus will be on agglomerative hierarchical methods.<!--D--> -- - Divisive hierarchical methods are very slow and we do not cover them at all.<!--D--> -- - We consider one example of a non-hierarchical method known as the **k-means** algorithm. --- # Definition of Clustering - Oxford Dictionary: A group of similar things or people positioned or occurring closely together<!--D--> -- - Collins Dictionary: A number of things growing, fastened, or occurring close together<!--D--> -- - Note the importance of closeness or distance. We need two concepts of distance<!--D--> -- 1. Distance between **observations**. 2. Distance between **clusters**. --- # A distance between clusters - Let `\(\mathcal{A}\)` be a cluster with observations `\(\left\{{\mathbf a}_1, {\mathbf a}_2, \ldots, {\mathbf a}_I \right\}\)` and `\(\mathcal{B}\)` be a cluster with points `\(\left\{{\mathbf b}_1, {\mathbf b}_2, \ldots, {\mathbf b}_J \right\}\)`. -- - The calligraphic script `\(\mathcal{A}\)` or `\(\mathcal{B}\)` denotes a cluster with possibly more than one point. -- - The bold scipt `\({\mathbf a}_i\)` or `\({\mathbf b}_j\)` denotes a vector of attributes (e.g. age and income) for each observation. -- - Rather than vectors, it is much easier to think of each observation as a point in a scatterplot. --- #Single Linkage One way of defining the distance between clusters `\(\mathcal{A}\)` and `\(\mathcal{B}\)` is `$$D(\mathcal{A},\mathcal{B})=\underset{i,j}{\min}D({\mathbf a}_i,{\mathbf b}_j)$$` This is called **single linkage** or **nearest neighbour**. --- # Single Linkage <img src="Clustering_files/figure-html/slinkp-1.png" style="display: block; margin: auto;" /> --- # Single Linkage <img src="Clustering_files/figure-html/slink-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage Another way of defining the distance between `\(\mathcal{A}\)` and `\(\mathcal{B}\)` is `$$D(\mathcal{A},\mathcal{B})=\underset{i,j}{\max}D({\mathbf a}_i,{\mathbf b}_j)$$` This is called **complete linkage** or **furthest neighbour**. --- # Complete Linkage <img src="Clustering_files/figure-html/clinkp-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage <img src="Clustering_files/figure-html/clink-1.png" style="display: block; margin: auto;" /> --- # Complete linkage - In the previous example **all** points in the red cluster are within a distance of 160.01 of **all** points in the blue cluster. - This is why it is called **complete** linkage. --- # A simple example - Over the next couple of slides we will go through the entire process of agglomerative clustering<!--D--> -- + We will use Euclidean distance to define distance between points<!--D--> -- + We will use single linkage to define the distance between clusters<!--D--> -- - There are only five observations and two variables --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="Clustering_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> --- # Hierarchical Clustering - 5-cluster solution A and B and C and D and E -- - 4-cluster solution \{A,D\} and B and C and E -- - 3-cluster solution \{A,D\} and \{B, C\} and E -- - 2-cluster solution \{A,B, C,D\} and E -- - 1-cluster solution \{A,B, C,D E\} --- # Dendrogram - The Dendrogram is a useful tool for analysing a cluster solution.<!--D--> -- + Observations are on one axis (usually x)<!--D--> -- + The distance between clusters is on other axis (usually y).<!--D--> -- + From the Dendrogram one can see the order in which the clusters are merged. --- # Dendrogram <img src="Clustering_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /> --- # Interpretation of Dendrogram - Think of the axis with distance (y-axis) as the measuring a 'tolerance level'<!--D--> -- - If the distance between two clusters is within the tolerance they are merged into one cluster.<!--D--> -- - As tolerance increases more and more clusters are merged leading to less clusters overall.<!--D--> --- # Clustering in R - Clustering in R requires at most 3 steps<!--D--> -- 1. Standardise the data if they are in different units (using the function `scale`)<!--D--> -- 2. Find the distance between all pairs of observations (using the function `dist`)<!--D--> -- 3. Cluster the data using the function `hclust`<!--D--> -- - Try this with the `mtcars` dataset. Use Euclidean distance and complete linkage. - Store the result of `hclust` in a variable called CarsCluster. --- # Clustering in R ```r data(mtcars) mtcars%>% scale%>% dist%>% hclust(method="complete")-> CarsCluster ``` --- # Dendrogram in R ```r plot(CarsCluster,cex=0.5) ``` <img src="Clustering_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- # Identifying clusters ```r CarsCluster%>%plot(cex=0.5) CarsCluster%>%rect.hclust(k=2) ``` <img src="Clustering_files/figure-html/unnamed-chunk-22-1.png" style="display: block; margin: auto;" /> --- # Dendrogram in R For an interactive tool try: ```r identify(CarsCluster) ``` Press the escape key when you are finished. --- class: inverse, center, middle #Choosing the number of clusters --- # Choosing clusters - Although hierarchical clustering gives a solution for any number of clusters, ultimately we only want to focus on one of these solutions. -- - There is no *correct* number of clusters. Choosing the number of clusters depends on the context. -- - There are however *poor* choices for the number of clusters. --- # Choosing clusters - Do not choose too many clusters: -- + A firm developing a different marketing strategy for each market segment may not have the resources to develop a large number of unique strategies. -- - Do not choose too few clusters: -- + If you choose the 1-cluster solution there is no point in doing clustering at all. --- # Using dendrogram - One criterion is that the number of clusters is stable over a wide range of tolerance.<!--D--> -- - The plot on the next slide shows a 3 cluster solution.<!--D--> --- # Three cluster solution <img src="Clustering_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Stability - The tolerance for a three cluster solution is about 5.9. -- - If the tolerance is increased *by a very small amount* then we will have a two cluster solution. -- - If the tolerance is decreased *by a very small amount* then we will have a four cluster solution. --- # Two cluster solution <img src="Clustering_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- # Four cluster solution <img src="Clustering_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- # Stability - In the previous example -- + The three cluster solution is not stable -- + The two and four cluster solutions are stable -- - In general look for a long stretch of tolerance, over which the number of clusters does not change. --- # Extracting the clusters For a given number of clusters we can create a new variable indicating cluster membership via the `cutree` function. ```r mem<-cutree(CarsCluster,2) ``` <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:200px; overflow-x: scroll; width:300px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> x </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Mazda RX4 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Mazda RX4 Wag </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Datsun 710 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Hornet 4 Drive </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Hornet Sportabout </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Valiant </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Duster 360 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Merc 240D </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Merc 230 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Merc 280 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Merc 280C </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Merc 450SE </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Merc 450SL </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Merc 450SLC </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Cadillac Fleetwood </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Lincoln Continental </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Chrysler Imperial </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Fiat 128 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Honda Civic </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Toyota Corolla </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Toyota Corona </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Dodge Challenger </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> AMC Javelin </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Camaro Z28 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Pontiac Firebird </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Fiat X1-9 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Porsche 914-2 </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Lotus Europa </td> <td style="text-align:right;"> 2 </td> </tr> <tr> <td style="text-align:left;"> Ford Pantera L </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Ferrari Dino </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Maserati Bora </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> Volvo 142E </td> <td style="text-align:right;"> 2 </td> </tr> </tbody> </table></div> --- # Pros and Cons of Single Linkage - Pros: + Single linkage is very easy to understand. + Single linkage is a very fast algorithm.<!--D--> -- - Cons: + Single linkage is very sensitive to single observations which leads to chaining. + Complete linkage avoids this problem and gives more compact clusters with a similar diameter. --- # Chaining <img src="Clustering_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Single Linkage Dendrogram <img src="Clustering_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- # Single Linkage <img src="Clustering_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- # Add one observation <img src="Clustering_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # New solution <img src="Clustering_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- # Dendrogram with Chaining <img src="Clustering_files/figure-html/unnamed-chunk-34-1.png" style="display: block; margin: auto;" /> --- # Robustness - In general adding a single observation should not dramatically change the analysis. -- - In this instance the new observation was not even an *outlier*. -- - A term used for such an observation is an *inlier*. -- - Methods that are not affected by single observations are often called **robust**. -- - Let's see if complete linkage is *robust* to the inlier. --- # Complete Linkage <img src="Clustering_files/figure-html/unnamed-chunk-35-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage: Dendrogram <img src="Clustering_files/figure-html/unnamed-chunk-36-1.png" style="display: block; margin: auto;" /> --- # Disadvantages of CL - Complete Linkage overcomes *chaining* and is robust to inliers -- - However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.<!--D--> -- - The following methods are more robust and should be preferred<!--D--> -- + Average Linkage + Centroid Method + Ward’s Method --- # Average Linkage The distance between two clusters can be defined so that it is based on all the pairwise distances between the elements of each cluster. `$$D(\mathcal{A},\mathcal{B})=\frac{1}{|\mathcal{A}||\mathcal{B}|}\sum\limits_{i=1}^{|\mathcal{A}|}\sum\limits_{j=1}^{|\mathcal{B}|}D({\mathbf a}_i,{\mathbf b}_j)$$` Here `\(|\mathcal{A}|\)` is the number of observations in cluster `\(\mathcal{A}\)` and `\(|\mathcal{B}|\)` is the number of observations in cluster `\(\mathcal{B}\)` --- #Average Linkage - Average linkage can be called different things<!--D--> -- + Between groups method. + Unweighted Pair Group Method with Arithmetic mean (UPGMA) --- # Pairwise distances (one obs.) <img src="Clustering_files/figure-html/unnamed-chunk-37-1.png" style="display: block; margin: auto;" /> --- # All pairwise distances <img src="Clustering_files/figure-html/unnamed-chunk-38-1.png" style="display: block; margin: auto;" /> --- # Centroid Method - The centroid of a cluster can be defined as the mean of all the points in the cluster.<!--D--> -- - If `\(\mathcal{A}\)` is a cluster containing the observations `\({\mathbf a}\)` then the **centroid** of `\(\mathcal{A}\)` is given by.<!--D--> -- `$${\mathbf{\bar{a}}}=\frac{1}{|\mathcal{A}|}\sum_{\mathbf{a}_i\in\mathcal{A}}\mathbf{a}_i$$`<!--D--> -- - The distance between two clusters can then be defined as the distance between the respective centroids. --- # Vector mean - Recall that `\(\mathbf{a}_i\)` is a vector of attributes, e.g income and age. -- - In this case `\(\bar{\mathbf{a}}\)` is also a vector of attributes. -- - Each element of `\(\bar{\mathbf{a}}\)` is the mean of a different attribute, e.g. mean income, mean age. --- # Centroid method <img src="Clustering_files/figure-html/unnamed-chunk-39-1.png" style="display: block; margin: auto;" /> --- # Centroid method <img src="Clustering_files/figure-html/unnamed-chunk-40-1.png" style="display: block; margin: auto;" /> --- # Average Linkage v Centroid - Consider an example with one variable (although everything works with vectors too).<!--D--> -- - Suppose we have the clusters `\(\mathcal{A}=\left\{0,2\right\}\)` and `\(\mathcal{B}=\left\{3,5\right\}\)`<!--D--> -- - Find the distance `\(\mathcal{A}\)` and `\(\mathcal{B}\)` using<!--D--> -- + Average Linkage + Centroid Method --- # Average Linkage - Must find distances between all pairs of observations<!--D--> -- + `\(D(a_1,b_1)=3\)` + `\(D(a_1,b_2)=5\)` + `\(D(a_2,b_1)=1\)` + `\(D(a_2,b_2)=3\)`<!--D--> -- - Averaging these, the distance is 3. --- # Centroid method - First find centroids<!--D--> -- + `\(\bar{a}=1\)` + `\(\bar{b}=4\)`<!--D--> -- - The distance is 3.<!--D--> -- - Here both methods give the same answer but when vectors are used instead they do not give the same answer in general. --- # Average Linkage v Centroid - In average linkage<!--D--> -- 1. Compute the distances between pairs of observations 2. Average these distances<!--D--> -- - In the centroid method<!--D--> -- 1. Average the observations to obtain the centroid of each cluster. 2. Find the distance between centroids --- # Ward's method - All methods so far, merge two clusters when the distance between them is small.<!--D--> -- - Ward’s method merges two clusters to minimise within cluster variance.<!--D--> -- - Two variations implemented in R.<!--D--> -- + `Ward.D2` is the same as the original Ward paper. + `Ward.D` is actually based on a mistake but can still work quite well. --- # Within Cluster Variance - The within-cluster variance for a cluster `\(\mathcal{A}\)` is defined as `$$\mbox{V}_{\mbox{w}}(\mathcal{A})=\frac{1}{|\mathcal{A}|-1}S(\mathcal{A})$$` where `$$S(\mathcal{A})=\sum_{\mathbf{a}_i\in\mathcal{A}}\left[\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)'\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)\right]$$` --- # Vector notation - The term `\(S(\mathcal{A})=\sum\limits_{\mathbf{a}_i\in\mathcal{A}}\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)'\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)\)` uses vector notation, but the idea is simple. -- - Take the difference of each attribute from its mean (e.g. income, age, etc.) -- - Then square them and add together over attributes **and** observations. -- - The within cluster variance is a total variance across all attributes. --- # Ward's algorithm - At each step we must merge two clusters to form a single cluster. -- - Suppose we pick a cluster `\(\mathcal{A}\)` and `\(\mathcal{B}\)` to form a new cluster `\(\mathcal{C}\)`. -- - Ward's algorithm chooses `\(\mathcal{A}\)` and `\(\mathcal{B}\)` so that `\(V_{W}(\mathcal{C})\)` is as small as possible. --- class: middle, center, inverse # Non-hierarchical Clustering --- # Non-hierarchical Clustering - In some analyses the exact number of clusters may be known. -- - If so non-hierachical clustering may be used. -- - Perhaps the most widely used non-hierarchical method is k-means clustering. --- # k-means - In general `\(k\)`-means seeks to find `\(k\)` clusters. -- - The following condition must be satisfied: -- + Each point in a must be closer to its **own** cluster centroid rather than the centroid of a different cluster. -- - Knowing the partition into clusters determines the mean. -- - Knowing the means determines the clusters. --- # Optimality - The objective of k-means clustering is to find centroids is a way that minimises within-cluster sum of squares. -- - Let `\({\mathbf C}=\left\{\mathcal{C}_1,\ldots,\mathcal{C}_k\right\}\)` be a partitioning of all points into `\(k\)` clusters. -- - The objective of k-means is to find `$$\underset{{\mathbf C}}{\mbox{argmin}}\sum\limits_{h=1}^k S(\mathcal{C}_h)$$` --- #NP hard - It is an example of an NP-hard problem -- - The bad news is that NP-hard problems cannot be easily solved by computers. -- - The good news is that your credit card security also relies on the fact that computers cannot easily solve NP-hard problems. --- # Heuristic - Fortunately there are algorithms that provide a reasonably good solution to the k-means problem. -- - In some cases they may provide the exact solution, although there are no guarantees. -- - We will now cover **Lloyd's algorithm** which provides good intuition into the k-means problem. -- - By default, R implements the more sophisticated (and complicated) **Hartigan Wong** algorithm. --- # Lloyd's algorithm 1. Choose initial centroids (possibly at random).<!--D--> -- 2. Allocate each observation to cluster corresponding with nearest centroid<!--D--> -- 3. Re-compute centroids as the mean of all observations in the cluster<!--D--> -- 4. Repeat steps 2 and 3 until convergence --- # Raw Data <img src="Clustering_files/figure-html/unnamed-chunk-41-1.png" style="display: block; margin: auto;" /> --- # Initial Centroids <img src="Clustering_files/figure-html/unnamed-chunk-42-1.png" style="display: block; margin: auto;" /> --- # Initial Allocation <img src="Clustering_files/figure-html/unnamed-chunk-43-1.png" style="display: block; margin: auto;" /> --- # Re-compute Centroids <img src="Clustering_files/figure-html/unnamed-chunk-44-1.png" style="display: block; margin: auto;" /> --- # Reallocate <img src="Clustering_files/figure-html/unnamed-chunk-45-1.png" style="display: block; margin: auto;" /> --- # Reallocate <img src="Clustering_files/figure-html/unnamed-chunk-46-1.png" style="display: block; margin: auto;" /> --- # Recompute Centroids <img src="Clustering_files/figure-html/unnamed-chunk-47-1.png" style="display: block; margin: auto;" /> --- # Reallocate <img src="Clustering_files/figure-html/unnamed-chunk-48-1.png" style="display: block; margin: auto;" /> --- # Reallocate <img src="Clustering_files/figure-html/unnamed-chunk-49-1.png" style="display: block; margin: auto;" /> --- # Stable solution <img src="Clustering_files/figure-html/unnamed-chunk-50-1.png" style="display: block; margin: auto;" /> --- # Wholesaler Data - Recall the Wholesaler data from earlier in the lecture<!--D--> -- - The variables are annual spend in 6 categories.<!--D--> -- - Should the data be standardised?<!--D--> -- - Try to carry out k means clustering using the R function `kmeans`<!--D--> -- - Find a solution with 3 clusters. --- # k-means in R To do a three cluster solution ```r WholesaleCluster<-kmeans(Wholesale,3) ``` If the data are in a data.frame you may need to select the numeric variables. --- # R output - The result of the R function kmeans will be a list containing several entries. The most interesting are<!--D--> -- + A variable indicating cluster membership is given in `cluster`<!--D--> -- + The centroids for each cluster are given in `centers`<!--D--> -- + The number of observations in each cluster is given by `size`<!--D--> -- + The cluster centroids can be useful for profiling the clusters. --- # Cluster Centroids <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:300px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Fresh </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Milk </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Grocery </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Frozen </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Detergents_Paper </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Delicassen </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 8000.04 </td> <td style="text-align:right;"> 18511.420 </td> <td style="text-align:right;"> 27573.900 </td> <td style="text-align:right;"> 1996.680 </td> <td style="text-align:right;"> 12407.360 </td> <td style="text-align:right;"> 2252.020 </td> </tr> <tr> <td style="text-align:right;"> 35941.40 </td> <td style="text-align:right;"> 6044.450 </td> <td style="text-align:right;"> 6288.617 </td> <td style="text-align:right;"> 6713.967 </td> <td style="text-align:right;"> 1039.667 </td> <td style="text-align:right;"> 3049.467 </td> </tr> <tr> <td style="text-align:right;"> 8253.47 </td> <td style="text-align:right;"> 3824.603 </td> <td style="text-align:right;"> 5280.455 </td> <td style="text-align:right;"> 2572.661 </td> <td style="text-align:right;"> 1773.058 </td> <td style="text-align:right;"> 1137.497 </td> </tr> </tbody> </table></div> --- # Robustness Check Since values are sensitive to starting values, we can run the algorithm with many different starting values using the `nstart` option ```r WholesaleCluster<-kmeans(Wholesale,3,nstart = 25) ``` <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:200px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Fresh </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Milk </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Grocery </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Frozen </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Detergents_Paper </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Delicassen </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 35941.40 </td> <td style="text-align:right;"> 6044.450 </td> <td style="text-align:right;"> 6288.617 </td> <td style="text-align:right;"> 6713.967 </td> <td style="text-align:right;"> 1039.667 </td> <td style="text-align:right;"> 3049.467 </td> </tr> <tr> <td style="text-align:right;"> 8000.04 </td> <td style="text-align:right;"> 18511.420 </td> <td style="text-align:right;"> 27573.900 </td> <td style="text-align:right;"> 1996.680 </td> <td style="text-align:right;"> 12407.360 </td> <td style="text-align:right;"> 2252.020 </td> </tr> <tr> <td style="text-align:right;"> 8253.47 </td> <td style="text-align:right;"> 3824.603 </td> <td style="text-align:right;"> 5280.455 </td> <td style="text-align:right;"> 2572.661 </td> <td style="text-align:right;"> 1773.058 </td> <td style="text-align:right;"> 1137.497 </td> </tr> </tbody> </table></div> --- # Label switching - Two slides back the second cluster had the highest spend on fresh food. -- - One slide back the first cluster that had the highest spend on fresh food. -- - The centroids were identical, they were just flipped around. This is called **Label switching**. -- - It does not matter which cluster is first, second or third. The means are important. --- # Number of clusters - The motivation of k means clustering is that the number of clusters is already known. -- - In principal different choices of `\(k\)` can be used and compared to one another. -- - However, unlike hierarchical clustering, these different solutions can contradict one another. --- # The meaning of non hierarchical - Consider the two cluster solution (Solution A) and three cluster solution (Solution B) for **hierarchical** clustering.<!--D--> -- + If two variables are in the same cluster in Solution B then they will be in the same cluster in Solution A<!--D--> -- - The same is not true for **non-hierarchical** clustering including k-means clustering. --- # Hierarchical Clustering Together we will use Ward's method to do hierarchical clustering on the Wholesale data and get the cluster membership from the two and three cluster solutions. Then you can try the same for k-means --- # Solution ```r Wholesale%>% dist%>% hclust(method='ward.D2')->hiercl cl2<-cutree(hiercl,2) cl3<-cutree(hiercl,3) table(cl2,cl3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 261 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 45 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 134 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table> --- # Same exercise for k-means ```r km2<-kmeans(Wholesale,2) kmcl2<-km2$cluster km3<-kmeans(Wholesale,3) kmcl3<-km3$cluster table(kmcl2,kmcl3) ``` <table> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 59 </td> <td style="text-align:right;"> 6 </td> </tr> <tr> <td style="text-align:left;"> 2 </td> <td style="text-align:right;"> 330 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 44 </td> </tr> </tbody> </table> --- # Non-hierarchical - Consider the observations in Cluster 3 when `\(k=3\)`. When we go from `\(k=3\)` to `\(k=2\)`<!--D--> -- + There are 6 of these observations that go to the new cluster 1.<!--D--> -- + The remaining 44 observations go to the new cluster 2.<!--D--> -- - Notice that there is some label switching as well. --- class: inverse, middle, center # Comparing Cluster solutions --- # Comparing Cluster solutions - A challenging aspect of cluster analysis is that it is difficult to evaluate a cluster solution.<!--D--> -- + In forecasting compare forecasts to outcomes.<!--D--> -- + In regression look at goodness of fit.<!--D--> -- - There is also very little theory to guide us.<!--D--> -- + In regression we know least squares is BLUE under certain assumptions.<!--D--> -- - How do we choose a clustering algorithm? --- # Choosing a method - There is no *ideal* method to do hierarchical clustering. -- - A good strategy is to try a few different methods. -- - If there is a clear structure in the data then most methods will give similar results. - It is not unusual to find one method yielding very different results. -- - If all methods give vastly different results then perhaps there are no clear clusters in the data. --- # Robustness - We can check if a clustering solution is robust to different algorithms.<!--D--> -- - For example if the centroid method, average linkage, Ward method and k-means all give similar clusters then we can be confident that the clusters are truly a feature of the data.<!--D--> -- - One way to evaluate this is to look at the Rand Index. --- # Rand Index - Suppose we have two cluster solutions, Solution A and Solution B.<!--D--> -- - Pick two observations at random `\({\mathbf x}\)` and `\({\mathbf y}\)`. -- 1. `\({\mathbf x}\)` and `\({\mathbf y}\)` are in the same cluster in Solution A and the same cluster in Solution B<!--D--> -- 2. `\({\mathbf x}\)` and `\({\mathbf y}\)` are in different clusters in Solution A and different clusters in Solution B<!--D--> -- 3. `\({\mathbf x}\)` and `\({\mathbf y}\)` are in the same cluster in Solution A and the different cluster in Solution B<!--D--> -- 4. `\({\mathbf x}\)` and `\({\mathbf y}\)` are in different clusters in Solution A and same clusters in Solution B --- # Rand Index - Scenario 1 and scenario 2 both suggest that the cluster solutions are in **agreement**<!--D--> -- - Scenario 3 and scenario 4 both suggest that the cluster solutions are in **disagreement**<!--D--> -- - The **Rand Index** gives the probability of picking two observations at random that are in agreement.<!--D--> -- - The **Rand Index** lies between 0 and 1 and higher numbers indicate agreement. --- # Adjusted Rand Index - Even if observations are clustered at random, there will still be some agreement due to chance.<!--D--> -- - The adjusted Rand index is designed to be 0 if the level of agreement is equivalent to the case where clustering is done at random.<!--D--> -- - It is still only equal to 1 if the two clustering solutions are in perfect agreement.<!--D--> -- - The adjusted Rand Index can be computed using the `adjustedRandIndex` function in the package `mclust` --- # Conclusion - There are many methods for clustering. -- - For this reason a cluster analysis should be carried out carefully and transparently. -- - Although we have focused on algorithms in the lecture, remember that the objective of cluster analysis is to explore the data. -- - As such remember to profile the clusters and to provide insight into what these clusters may represent.