data(mtcars)
MakeModel | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
After clustering we get the following cluster means.
Cluster | Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
---|---|---|---|---|---|---|
1 | 35941 | 6044 | 6289 | 6714 | 1040 | 3049 |
2 | 8253 | 3825 | 5280 | 2573 | 1773 | 1137 |
3 | 8000 | 18511 | 27574 | 1997 | 12407 | 2252 |
One way of defining the distance between clusters A and B is
D(A,B)=mini,jD(ai,bj)
This is called single linkage or nearest neighbour.
Another way of defining the distance between A and B is
D(A,B)=maxi,jD(ai,bj)
This is called complete linkage or furthest neighbour.
scale
)scale
)dist
)scale
)dist
)hclust
scale
)dist
)hclust
mtcars
dataset. Use Euclidean distance and complete linkage. hclust
in a variable called CarsCluster.data(mtcars)mtcars%>% scale%>% dist%>% hclust(method="complete")-> CarsCluster
plot(CarsCluster,cex=0.5)
CarsCluster%>%plot(cex=0.5)CarsCluster%>%rect.hclust(k=2)
For an interactive tool try:
identify(CarsCluster)
Press the escape key when you are finished.
For a given number of clusters we can create a new variable indicating cluster membership via the cutree
function.
mem<-cutree(CarsCluster,2)
x | |
---|---|
Mazda RX4 | 1 |
Mazda RX4 Wag | 1 |
Datsun 710 | 2 |
Hornet 4 Drive | 2 |
Hornet Sportabout | 1 |
Valiant | 2 |
Duster 360 | 1 |
Merc 240D | 2 |
Merc 230 | 2 |
Merc 280 | 2 |
Merc 280C | 2 |
Merc 450SE | 1 |
Merc 450SL | 1 |
Merc 450SLC | 1 |
Cadillac Fleetwood | 1 |
Lincoln Continental | 1 |
Chrysler Imperial | 1 |
Fiat 128 | 2 |
Honda Civic | 2 |
Toyota Corolla | 2 |
Toyota Corona | 2 |
Dodge Challenger | 1 |
AMC Javelin | 1 |
Camaro Z28 | 1 |
Pontiac Firebird | 1 |
Fiat X1-9 | 2 |
Porsche 914-2 | 2 |
Lotus Europa | 2 |
Ford Pantera L | 1 |
Ferrari Dino | 1 |
Maserati Bora | 1 |
Volvo 142E | 2 |
The distance between two clusters can be defined so that it is based on all the pairwise distances between the elements of each cluster. D(A,B)=1|A||B||A|∑i=1|B|∑j=1D(ai,bj) Here |A| is the number of observations in cluster A and |B| is the number of observations in cluster B
Ward.D2
is the same as the original Ward paper.Ward.D
is actually based on a mistake but can still work quite well.Vw(A)=1|A|−1S(A)
where S(A)=∑ai∈A[(ai−¯a)′(ai−¯a)]
kmeans
kmeans
To do a three cluster solution
WholesaleCluster<-kmeans(Wholesale,3)
If the data are in a data.frame you may need to select the numeric variables.
cluster
cluster
centers
cluster
centers
size
cluster
centers
size
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
---|---|---|---|---|---|
8000.04 | 18511.420 | 27573.900 | 1996.680 | 12407.360 | 2252.020 |
35941.40 | 6044.450 | 6288.617 | 6713.967 | 1039.667 | 3049.467 |
8253.47 | 3824.603 | 5280.455 | 2572.661 | 1773.058 | 1137.497 |
Since values are sensitive to starting values, we can run the algorithm with many different starting values using the nstart
option
WholesaleCluster<-kmeans(Wholesale,3,nstart = 25)
Fresh | Milk | Grocery | Frozen | Detergents_Paper | Delicassen |
---|---|---|---|---|---|
35941.40 | 6044.450 | 6288.617 | 6713.967 | 1039.667 | 3049.467 |
8000.04 | 18511.420 | 27573.900 | 1996.680 | 12407.360 | 2252.020 |
8253.47 | 3824.603 | 5280.455 | 2572.661 | 1773.058 | 1137.497 |
Together we will use Ward's method to do hierarchical clustering on the Wholesale data and get the cluster membership from the two and three cluster solutions.
Then you can try the same for k-means
Wholesale%>% dist%>% hclust(method='ward.D2')->hierclcl2<-cutree(hiercl,2)cl3<-cutree(hiercl,3)table(cl2,cl3)
1 | 2 | 3 | |
---|---|---|---|
1 | 261 | 0 | 45 |
2 | 0 | 134 | 0 |
km2<-kmeans(Wholesale,2)kmcl2<-km2$clusterkm3<-kmeans(Wholesale,3)kmcl3<-km3$clustertable(kmcl2,kmcl3)
1 | 2 | 3 | |
---|---|---|---|
1 | 0 | 59 | 6 |
2 | 330 | 1 | 44 |
adjustedRandIndex
function in the package mclust
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |