class: center, middle, inverse, title-slide .title[ # Week 11:Clustering andthe Dendrogram ] .author[ ### Visual Data Analytics ] .institute[ ### University of Sydney ] --- # Outline - Distance - Single Linkage - Other Hierarchical Clustering - Dendrogram --- # Motivation - We can profile an individual according to their attributes. - Are two individuals similar? - Can we group to individuals together? - How can we visualise this? - The method is hierarchical clustering and the visualisation is the Dendrogram. - The ideas we cover are useful in marketing and other business problems. --- class: middle, center, inverse # Distance --- # Why distance? - Many problems that involve thinking about how *similar* or dissimilar two observations are. For example:<!--D--> -- + May use the same marketing strategy for *similar* demographic groups. + May lend money to applicants who are *similar* to those who pay debts back.<!--D--> -- - Arguably the most important concept in data analysis is *distance* --- # Simple example - Consider 3 individuals:<!--D--> -- + Mr Orange: 37 years of age earns $75k a year + Mr Red: 31 years of age earns $67k a year + Mr Blue: 30 years of age earns $68k a year<!--D--> -- - Which two are the most similar? --- # On a scatterplot <img src="11Clustering_files/figure-html/scatter-1.png" style="display: block; margin: auto;" /> --- # Distance as a number - It is easy to think about three individuals but what if there are thousands of individuals?<!--D--> -- + In this case it will be useful to attach some number to the distance between pairs of individuals<!--D--> -- + We will do it with a simple application of Pythagoras' theorem. --- # Finding the Distance <img src="11Clustering_files/figure-html/pytha-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythb-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythc-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythd-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythe-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythf-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythg-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythh-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="11Clustering_files/figure-html/pythi-1.png" style="display: block; margin: auto;" /> --- # Euclidean distance - In general there are more than two variables.<!--D--> -- - Is there a way to apply our intuition in 2 dimensions to higher dimensions?<!--D--> -- + Pythagoras' theorem can be *generalised* to higher dimensions.<!--D--> -- + This results in a concept of distance called *Euclidean distance*. --- # Euclidean distance We measure `\(p\)` variables for two observations: `\(x_{j}\)` is the measurement of variable `\(j\)` for observation `\({\mathbf x}\)`, `\(y_{j}\)` is the measurement of variable `\(j\)` for observation `\({\mathbf y}\)`. *Euclidean* distance between `\({\mathbf x}\)` and `\({\mathbf y}\)` is: `$$D\left({\mathbf x},{\mathbf y}\right)=\sqrt{\sum\limits_{j=1}^p \left(x_{j}-y_{j}\right)^2}$$` --- # Distance and Standardising data - We must be careful about the units of measurement.<!--D--> -- - Euclidean distance will change when variables measured in *different units*.<!--D--> -- - For this reason, it is common to calculate distance after the *standardising* data.<!--D--> -- - If the variables are all measured in the same units, then this standardisation is unecessary.<!--D--> --- # Other kinds of distance - We will nearly always use Euclidean Distance in this unit, however there are other ways of understanding distance . - This includes distance measures for categorical data and even strings of text! - While we will not cover these, the methods of hierarchical clustering we cover will work as long as we have some way of defining distance between individuals. --- # Why is distance useful? <img src="img/CollaborativeFiltering.jpeg" width="1333" style="display: block; margin: auto;" /> Figure by Mohamed Ben Ellefi --- # Recommender Systems - Famous recommender systems are used by Amazon, Netflix, Alibaba amongst others. - These systems are usually a hybrid of - Collaborative Filtering - Content-based Filtering - The method we discussed is more specifically called memory-based collaborative filtering. - Being able to put customers into similar groups is important. --- class: middle, center, inverse # Hierarchical Clustering --- # Age v Income <img src="11Clustering_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Obvious clusters <img src="11Clustering_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Summary - When there are more than 2 variables just looking at a scatterplot doesn’t work.<!--D--> -- - Instead algorithms can be used to find the clusters in a sensible way, even in high dimensions. --- # Definition of Clustering - Oxford Dictionary: A group of similar things or people positioned or occurring closely together<!--D--> -- - Collins Dictionary: A number of things growing, fastened, or occurring close together<!--D--> -- - Note the importance of closeness or distance. We need two concepts of distance<!--D--> -- 1. Distance between **observations**. 2. Distance between **clusters**. --- # A distance between clusters - Let `\(\mathcal{A}\)` be a cluster with observations `\(\left\{{\mathbf a}_1, {\mathbf a}_2, \ldots, {\mathbf a}_I \right\}\)` and `\(\mathcal{B}\)` be a cluster with points `\(\left\{{\mathbf b}_1, {\mathbf b}_2, \ldots, {\mathbf b}_J \right\}\)`. -- - The calligraphic script `\(\mathcal{A}\)` or `\(\mathcal{B}\)` denotes a cluster with possibly more than one point. -- - The bold script `\({\mathbf a}_i\)` or `\({\mathbf b}_j\)` denotes a vector of attributes (e.g. age and income) for each observation. -- - Rather than vectors, it is much easier to think of each observation as a point in a scatterplot. --- #Single Linkage One way of defining the distance between clusters `\(\mathcal{A}\)` and `\(\mathcal{B}\)` is `$$D(\mathcal{A},\mathcal{B})=\underset{i,j}{\min}D({\mathbf a}_i,{\mathbf b}_j)$$` This is called **single linkage** or **nearest neighbour**. --- # Single Linkage <img src="11Clustering_files/figure-html/slinkp-1.png" style="display: block; margin: auto;" /> --- # Single Linkage <img src="11Clustering_files/figure-html/slink-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage Another way of defining the distance between `\(\mathcal{A}\)` and `\(\mathcal{B}\)` is `$$D(\mathcal{A},\mathcal{B})=\underset{i,j}{\max}D({\mathbf a}_i,{\mathbf b}_j)$$` This is called **complete linkage** or **furthest neighbour**. --- # Complete Linkage <img src="11Clustering_files/figure-html/clinkp-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage <img src="11Clustering_files/figure-html/clink-1.png" style="display: block; margin: auto;" /> --- # Complete linkage - In the previous example **all** points in the red cluster are within a distance of 160.01 of **all** points in the blue cluster. - This is why it is called **complete** linkage. --- # A simple example - Over the next couple of slides we will go through the entire process of agglomerative clustering<!--D--> -- + We will use Euclidean distance to define distance between points<!--D--> -- + We will use single linkage to define the distance between clusters<!--D--> -- - There are only five observations and two variables --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-8-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- # Agglomerative clustering <img src="11Clustering_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Hierarchical Clustering - 5-cluster solution A and B and C and D and E -- - 4-cluster solution \{A,D\} and B and C and E -- - 3-cluster solution \{A,D\} and \{B, C\} and E -- - 2-cluster solution \{A,B, C,D\} and E -- - 1-cluster solution \{A,B, C,D E\} --- # Dendrogram - The Dendrogram is a useful tool for analysing a cluster solution.<!--D--> -- + Observations are on one axis (usually x)<!--D--> -- + The distance between clusters is on other axis (usually y).<!--D--> -- + From the Dendrogram one can see the order in which the clusters are merged. --- # Dendrogram <img src="11Clustering_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /> --- # Interpretation of Dendrogram - Think of the axis with distance (y-axis) as the measuring a 'tolerance level'<!--D--> -- - If the distance between two clusters is within the tolerance they are merged into one cluster.<!--D--> -- - As tolerance increases more and more clusters are merged leading to less clusters overall.<!--D--> --- # A real example using Python - We will use the mpg dataset from Seaborn - Observations are cars - Variables are related to engine size, fuel efficiency, etc. - Will make car name the index - Will remove non numeric variables (origin and name) - We will drop observations with missing values. --- # Data processing ```python cars = sns.load_dataset('mpg') cars73 = cars[cars['model_year']==73] cars73.index = cars73['name'] carsnum = cars73.iloc[:,0:6] carsnum = carsnum.dropna(how = 'any') carsnum ``` ``` ## mpg cylinders ... weight acceleration ## name ... ## buick century 350 13.0 8 ... 4100 13.0 ## amc matador 14.0 8 ... 3672 11.5 ## chevrolet malibu 13.0 8 ... 3988 13.0 ## ford gran torino 14.0 8 ... 4042 14.5 ## dodge coronet custom 15.0 8 ... 3777 12.5 ## mercury marquis brougham 12.0 8 ... 4952 11.5 ## chevrolet caprice classic 13.0 8 ... 4464 12.0 ## ford ltd 13.0 8 ... 4363 13.0 ## plymouth fury gran sedan 14.0 8 ... 4237 14.5 ## chrysler new yorker brougham 13.0 8 ... 4735 11.0 ## buick electra 225 custom 12.0 8 ... 4951 11.0 ## amc ambassador brougham 13.0 8 ... 3821 11.0 ## plymouth valiant 18.0 6 ... 3121 16.5 ## chevrolet nova custom 16.0 6 ... 3278 18.0 ## amc hornet 18.0 6 ... 2945 16.0 ## ford maverick 18.0 6 ... 3021 16.5 ## plymouth duster 23.0 6 ... 2904 16.0 ## volkswagen super beetle 26.0 4 ... 1950 21.0 ## chevrolet impala 11.0 8 ... 4997 14.0 ## ford country 12.0 8 ... 4906 12.5 ## plymouth custom suburb 13.0 8 ... 4654 13.0 ## oldsmobile vista cruiser 12.0 8 ... 4499 12.5 ## amc gremlin 18.0 6 ... 2789 15.0 ## toyota carina 20.0 4 ... 2279 19.0 ## chevrolet vega 21.0 4 ... 2401 19.5 ## datsun 610 22.0 4 ... 2379 16.5 ## maxda rx3 18.0 3 ... 2124 13.5 ## ford pinto 19.0 4 ... 2310 18.5 ## mercury capri v6 21.0 6 ... 2472 14.0 ## fiat 124 sport coupe 26.0 4 ... 2265 15.5 ## chevrolet monte carlo s 15.0 8 ... 4082 13.0 ## pontiac grand prix 16.0 8 ... 4278 9.5 ## fiat 128 29.0 4 ... 1867 19.5 ## opel manta 24.0 4 ... 2158 15.5 ## audi 100ls 20.0 4 ... 2582 14.0 ## volvo 144ea 19.0 4 ... 2868 15.5 ## dodge dart custom 15.0 8 ... 3399 11.0 ## saab 99le 24.0 4 ... 2660 14.0 ## toyota mark ii 20.0 6 ... 2807 13.5 ## oldsmobile omega 11.0 8 ... 3664 11.0 ## ## [40 rows x 6 columns] ``` --- # Plot ```python sns.clustermap(carsnum, standard_scale=1) ``` <img src="11Clustering_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> --- # What do we see? - Notice there are two dendrograms - One groups observations together - The other groups variables together - The inside is a heatmap for the data matrix - Cars most easily grouped by cylinders. - Also groupings in variables. --- # Dendrogram only ```python from scipy.cluster.hierarchy import dendrogram, linkage import numpy as np plt.figure() Z = linkage(carsnum, 'average') dendrogram(Z, orientation = 'right', leaf_font_size=8, labels=carsnum.index) ``` ``` ## {'icoord': [[5.0, 5.0, 15.0, 15.0], [25.0, 25.0, 35.0, 35.0], [55.0, 55.0, 65.0, 65.0], [45.0, 45.0, 60.0, 60.0], [85.0, 85.0, 95.0, 95.0], [75.0, 75.0, 90.0, 90.0], [52.5, 52.5, 82.5, 82.5], [30.0, 30.0, 67.5, 67.5], [10.0, 10.0, 48.75, 48.75], [115.0, 115.0, 125.0, 125.0], [105.0, 105.0, 120.0, 120.0], [135.0, 135.0, 145.0, 145.0], [165.0, 165.0, 175.0, 175.0], [155.0, 155.0, 170.0, 170.0], [195.0, 195.0, 205.0, 205.0], [185.0, 185.0, 200.0, 200.0], [162.5, 162.5, 192.5, 192.5], [140.0, 140.0, 177.5, 177.5], [112.5, 112.5, 158.75, 158.75], [29.375, 29.375, 135.625, 135.625], [235.0, 235.0, 245.0, 245.0], [225.0, 225.0, 240.0, 240.0], [215.0, 215.0, 232.5, 232.5], [255.0, 255.0, 265.0, 265.0], [223.75, 223.75, 260.0, 260.0], [275.0, 275.0, 285.0, 285.0], [295.0, 295.0, 305.0, 305.0], [280.0, 280.0, 300.0, 300.0], [315.0, 315.0, 325.0, 325.0], [335.0, 335.0, 345.0, 345.0], [320.0, 320.0, 340.0, 340.0], [355.0, 355.0, 365.0, 365.0], [385.0, 385.0, 395.0, 395.0], [375.0, 375.0, 390.0, 390.0], [360.0, 360.0, 382.5, 382.5], [330.0, 330.0, 371.25, 371.25], [290.0, 290.0, 350.625, 350.625], [241.875, 241.875, 320.3125, 320.3125], [82.5, 82.5, 281.09375, 281.09375]], 'dcoord': [[0.0, 88.03550420143, 88.03550420143, 0.0], [0.0, 59.481089431852205, 59.481089431852205, 0.0], [0.0, 15.787653403846944, 15.787653403846944, 0.0], [0.0, 45.880216444950214, 45.880216444950214, 15.787653403846944], [0.0, 44.74371464239419, 44.74371464239419, 0.0], [0.0, 92.92104792281953, 92.92104792281953, 44.74371464239419], [45.880216444950214, 137.6616011795999, 137.6616011795999, 92.92104792281953], [59.481089431852205, 213.9416452234791, 213.9416452234791, 137.6616011795999], [88.03550420143, 394.37109086075696, 394.37109086075696, 213.9416452234791], [0.0, 147.7125587077822, 147.7125587077822, 0.0], [0.0, 227.8631429671649, 227.8631429671649, 147.7125587077822], [0.0, 80.68457101577724, 80.68457101577724, 0.0], [0.0, 53.73081052803875, 53.73081052803875, 0.0], [0.0, 103.67267510400372, 103.67267510400372, 53.73081052803875], [0.0, 71.09852319141376, 71.09852319141376, 0.0], [0.0, 108.9847129836455, 108.9847129836455, 71.09852319141376], [103.67267510400372, 159.73982964589698, 159.73982964589698, 108.9847129836455], [80.68457101577724, 283.94470086840636, 283.94470086840636, 159.73982964589698], [227.8631429671649, 455.68741053034796, 455.68741053034796, 283.94470086840636], [394.37109086075696, 731.2255841366056, 731.2255841366056, 455.68741053034796], [0.0, 37.5, 37.5, 0.0], [0.0, 77.17101949381389, 77.17101949381389, 37.5], [0.0, 89.45082340114844, 89.45082340114844, 77.17101949381389], [0.0, 122.43365550370535, 122.43365550370535, 0.0], [89.45082340114844, 264.3222136721872, 264.3222136721872, 122.43365550370535], [0.0, 55.58102194094671, 55.58102194094671, 0.0], [0.0, 65.81223290544092, 65.81223290544092, 0.0], [55.58102194094671, 136.44251326052807, 136.44251326052807, 65.81223290544092], [0.0, 35.04283093587046, 35.04283093587046, 0.0], [0.0, 72.71347880551446, 72.71347880551446, 0.0], [35.04283093587046, 89.33483360495939, 89.33483360495939, 72.71347880551446], [0.0, 68.01654210557899, 68.01654210557899, 0.0], [0.0, 121.783619588186, 121.783619588186, 0.0], [0.0, 126.15128609596283, 126.15128609596283, 121.783619588186], [68.01654210557899, 198.65995652394722, 198.65995652394722, 126.15128609596283], [89.33483360495939, 322.07743458134325, 322.07743458134325, 198.65995652394722], [136.44251326052807, 497.4533751449865, 497.4533751449865, 322.07743458134325], [264.3222136721872, 795.4271325103182, 795.4271325103182, 497.4533751449865], [731.2255841366056, 1742.0563117804074, 1742.0563117804074, 795.4271325103182]], 'ivl': ['volkswagen super beetle', 'fiat 128', 'maxda rx3', 'opel manta', 'ford pinto', 'toyota carina', 'fiat 124 sport coupe', 'mercury capri v6', 'chevrolet vega', 'datsun 610', 'plymouth valiant', 'chevrolet nova custom', 'dodge dart custom', 'audi 100ls', 'saab 99le', 'ford maverick', 'amc hornet', 'plymouth duster', 'amc gremlin', 'volvo 144ea', 'toyota mark ii', 'chevrolet impala', 'ford country', 'mercury marquis brougham', 'buick electra 225 custom', 'chrysler new yorker brougham', 'plymouth custom suburb', 'amc matador', 'oldsmobile omega', 'dodge coronet custom', 'amc ambassador brougham', 'buick century 350', 'chevrolet monte carlo s', 'chevrolet malibu', 'ford gran torino', 'chevrolet caprice classic', 'oldsmobile vista cruiser', 'plymouth fury gran sedan', 'ford ltd', 'pontiac grand prix'], 'leaves': [17, 32, 26, 33, 27, 23, 29, 28, 24, 25, 12, 13, 36, 34, 37, 15, 14, 16, 22, 35, 38, 18, 19, 5, 10, 9, 20, 1, 39, 4, 11, 0, 30, 2, 3, 6, 21, 8, 7, 31], 'color_list': ['C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C1', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C2', 'C0']} ``` --- # Dendrogram ```python plt.show() ``` <img src="11Clustering_files/figure-html/unnamed-chunk-21-3.png" width="60%" style="display: block; margin: auto;" /> --- # More about the code - The hierarchical clustering is done using the scipy package. - Information can also be pulled out of the object created by this package. - By default this package does not do simple linkage. - We will now see why. --- class: middle, center, inverse # Other clustering methods --- # Pros and Cons of Single Linkage - Pros: + Single linkage is very easy to understand. + Single linkage is a very fast algorithm.<!--D--> -- - Cons: + Single linkage is very sensitive to single observations which leads to chaining. + Complete linkage avoids this problem and gives more compact clusters with a similar diameter. --- # Chaining <img src="11Clustering_files/figure-html/unnamed-chunk-22-5.png" style="display: block; margin: auto;" /> --- # Single Linkage Dendrogram <img src="11Clustering_files/figure-html/unnamed-chunk-23-1.png" style="display: block; margin: auto;" /> --- # Single Linkage <img src="11Clustering_files/figure-html/unnamed-chunk-24-1.png" style="display: block; margin: auto;" /> --- # Add one observation <img src="11Clustering_files/figure-html/unnamed-chunk-25-1.png" style="display: block; margin: auto;" /> --- # New solution <img src="11Clustering_files/figure-html/unnamed-chunk-26-1.png" style="display: block; margin: auto;" /> --- # Dendrogram with Chaining <img src="11Clustering_files/figure-html/unnamed-chunk-27-1.png" style="display: block; margin: auto;" /> --- # Robustness - In general adding a single observation should not dramatically change the analysis. -- - In this instance the new observation was not even an *outlier*. -- - A term used for such an observation is an *inlier*. -- - Methods that are not affected by single observations are often called **robust**. -- - Let's see if complete linkage is *robust* to the inlier. --- # Complete Linkage <img src="11Clustering_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto;" /> --- # Complete Linkage: Dendrogram <img src="11Clustering_files/figure-html/unnamed-chunk-29-1.png" style="display: block; margin: auto;" /> --- # Disadvantages of CL - Complete Linkage overcomes *chaining* and is robust to inliers -- - However, since the distance between clusters only depends on two observations it can still be sensitive to outliers.<!--D--> -- - The following methods are more robust and should be preferred<!--D--> -- + Average Linkage + Centroid Method + Ward’s Method --- # Average Linkage The distance between two clusters can be defined so that it is based on all the pairwise distances between the elements of each cluster. `$$D(\mathcal{A},\mathcal{B})=\frac{1}{|\mathcal{A}||\mathcal{B}|}\sum\limits_{i=1}^{|\mathcal{A}|}\sum\limits_{j=1}^{|\mathcal{B}|}D({\mathbf a}_i,{\mathbf b}_j)$$` Here `\(|\mathcal{A}|\)` is the number of observations in cluster `\(\mathcal{A}\)` and `\(|\mathcal{B}|\)` is the number of observations in cluster `\(\mathcal{B}\)` --- #Average Linkage - Average linkage can be called different things<!--D--> -- + Between groups method. + Unweighted Pair Group Method with Arithmetic mean (UPGMA) --- # Pairwise distances (one obs.) <img src="11Clustering_files/figure-html/unnamed-chunk-30-1.png" style="display: block; margin: auto;" /> --- # All pairwise distances <img src="11Clustering_files/figure-html/unnamed-chunk-31-1.png" style="display: block; margin: auto;" /> --- # Centroid Method - The centroid of a cluster can be defined as the mean of all the points in the cluster.<!--D--> -- - If `\(\mathcal{A}\)` is a cluster containing the observations `\({\mathbf a}\)` then the **centroid** of `\(\mathcal{A}\)` is given by.<!--D--> -- `$${\mathbf{\bar{a}}}=\frac{1}{|\mathcal{A}|}\sum_{\mathbf{a}_i\in\mathcal{A}}\mathbf{a}_i$$`<!--D--> -- - The distance between two clusters can then be defined as the distance between the respective centroids. --- # Vector mean - Recall that `\(\mathbf{a}_i\)` is a vector of attributes, e.g income and age. -- - In this case `\(\bar{\mathbf{a}}\)` is also a vector of attributes. -- - Each element of `\(\bar{\mathbf{a}}\)` is the mean of a different attribute, e.g. mean income, mean age. --- # Centroid method <img src="11Clustering_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # Centroid method <img src="11Clustering_files/figure-html/unnamed-chunk-33-1.png" style="display: block; margin: auto;" /> --- # Average Linkage v Centroid - Consider an example with one variable (although everything works with vectors too).<!--D--> -- - Suppose we have the clusters `\(\mathcal{A}=\left\{0,2\right\}\)` and `\(\mathcal{B}=\left\{3,5\right\}\)`<!--D--> -- - Find the distance `\(\mathcal{A}\)` and `\(\mathcal{B}\)` using<!--D--> -- + Average Linkage + Centroid Method --- # Average Linkage - Must find distances between all pairs of observations<!--D--> -- + `\(D(a_1,b_1)=3\)` + `\(D(a_1,b_2)=5\)` + `\(D(a_2,b_1)=1\)` + `\(D(a_2,b_2)=3\)`<!--D--> -- - Averaging these, the distance is 3. --- # Centroid method - First find centroids<!--D--> -- + `\(\bar{a}=1\)` + `\(\bar{b}=4\)`<!--D--> -- - The distance is 3.<!--D--> -- - Here both methods give the same answer but when vectors are used instead they do not give the same answer in general. --- # Average Linkage v Centroid - In average linkage<!--D--> -- 1. Compute the distances between pairs of observations 2. Average these distances<!--D--> -- - In the centroid method<!--D--> -- 1. Average the observations to obtain the centroid of each cluster. 2. Find the distance between centroids --- # Ward's method - All methods so far, merge two clusters when the distance between them is small.<!--D--> -- - Ward’s method merges two clusters to minimise within cluster variance.<!--D--> --- # Within Cluster Variance - The within-cluster variance for a cluster `\(\mathcal{A}\)` is defined as `$$\mbox{V}_{\mbox{w}}(\mathcal{A})=\frac{1}{|\mathcal{A}|-1}S(\mathcal{A})$$` where `$$S(\mathcal{A})=\sum_{\mathbf{a}_i\in\mathcal{A}}\left[\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)'\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)\right]$$` --- # Vector notation - The term `\(S(\mathcal{A})=\sum\limits_{\mathbf{a}_i\in\mathcal{A}}\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)'\left(\mathbf{a}_i-{\mathbf{\bar{a}}}\right)\)` uses vector notation, but the idea is simple. -- - Take the difference of each attribute from its mean (e.g. income, age, etc.) -- - Then square them and add together over attributes **and** observations. -- - The within cluster variance is a total variance across all attributes. --- # Ward's algorithm - At each step we must merge two clusters to form a single cluster. -- - Suppose we pick a cluster `\(\mathcal{A}\)` and `\(\mathcal{B}\)` to form a new cluster `\(\mathcal{C}\)`. -- - Ward's algorithm chooses `\(\mathcal{A}\)` and `\(\mathcal{B}\)` so that `\(V_{W}(\mathcal{C})\)` is as small as possible. --- class: middle, center, inverse # Wrap-up --- # Conclusions - We have covered hierarchical clustering - In BUSS6002 you will also cover *k-means clustering*. - An advantage of hierarchical clustering is visualisation via the dendrogram. - However the ideas of understanding when observations are similar, is useful in many other areas of business analytics. --- class: middle, center, inverse # Questions