class: center, middle, inverse, title-slide # Multidimensional Scaling ## High Dimensional Data Analysis ### Anastasios Panagiotelis & Ruben Loaiza-Maya ### Lecture 5 --- class: inverse, center, middle # Motivation --- # Motivation - Previously we looked at the concept of *distance* between observations.<!--D--> -- - We looked at our usual understanding of distance known as *Euclidean* distance.<!--D--> -- - We also looked at higher dimensional versions of Euclidean distance.<!--D--> -- - Other distance metrics including *Jaccard* distance can be used for categorical data. --- # Can we see distance? - Suppose we have `\(n\)` observations and the distance between each possible pair of observations.<!--D--> -- - A scatterplot shows whether observations are close together or far apart.<!--D--> -- - This works nicely when there are 2 variables.<!--D--> --- # Higher-dimensional plots - Suppose we have `\(p\)` variables where `\(p\)` is large. -- - Consider `\(p\)`-dimensional Euclidean distances.<!--D--> -- - Can we represent these using just `\(2\)`-dimensions?<!--D--> -- - Unfortunately the answer is no...<!--D--> -- - ... but we can get a good approximation --- # Multidimensional Scaling - Multidimensional scaling (MDS) finds a low (usually 2) dimensional representation.<!--D--> -- - The pairwise 2D Euclidean distances in this representation should be as *close* as possible to the original distances.<!--D--> -- - The meaning of *close* can vary since there are different ways to do MDS.<!--D--> -- - However MDS always begins with a matrix of distances and ends with a low dimensional representation that can be plotted. --- # An optical illusion with Beyonce ![Beyonce and the Eiffel Tower](beyonce.jpg) --- # Why does the illusion work? - The photo is a 2D representation of a 3D reality.<!--D--> -- - In reality the distance between Beyonce's hand and the Eiffel Tower is large.<!--D--> -- - In the 2D photo, this distance is small.<!--D--> -- - This is a misleading representation to understand the distance between Beyonce's hand and the Eiffel Tower.<!--D--> -- - A much more informative representation could be found by *rotation*. --- # Why do we care? - An important issue in business is to profile the market. For example<!--D--> -- + Which products do customers perceive to be similar to one another?<!--D--> -- + Who is my closest competitor?<!--D--> -- + Are there ‘gaps’ in the market, where a new product can be introduced?<!--D--> -- - Multidimensional Scaling can help us to produce a simple visualisation that can address these questions. --- # Beer Example <img src="MDS_files/figure-html/beer-1.png" style="display: block; margin: auto;" /> --- # Beer Example - The plot on the previous slide is an MDS solution for the beer dataset.<!--D--> -- - The data are 5-dimensional so we cannot use a scatter plot.<!--D--> -- - MDS shows Olympia Gold Light and Pabst Extra Light are similar (both light beers).<!--D--> -- - This also suggest that there is a low number of competitors with St Pauli Girl.<!--D--> -- - This may also reflect that the attributes of St Pauli Girl are not desired by customers.<!--D--> -- - How did we get the plot? --- # Beer Data <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> beer </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> rating </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> origin </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> avail </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> price </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> cost </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> calories </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> sodium </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> alcohol </th> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> light </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Olympia Gold Light </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 2.75 </td> <td style="text-align:right;"> 0.46 </td> <td style="text-align:right;"> 72 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:right;"> 2.9 </td> <td style="text-align:left;"> LIGHT </td> </tr> <tr> <td style="text-align:left;"> Pabst Extra Light </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> National </td> <td style="text-align:right;"> 2.29 </td> <td style="text-align:right;"> 0.38 </td> <td style="text-align:right;"> 68 </td> <td style="text-align:right;"> 15 </td> <td style="text-align:right;"> 2.3 </td> <td style="text-align:left;"> LIGHT </td> </tr> <tr> <td style="text-align:left;"> Schlitz Light </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> National </td> <td style="text-align:right;"> 2.79 </td> <td style="text-align:right;"> 0.47 </td> <td style="text-align:right;"> 97 </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 4.2 </td> <td style="text-align:left;"> LIGHT </td> </tr> <tr> <td style="text-align:left;"> Blatz </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 1.79 </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 144 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 4.6 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> Hamms </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 2.59 </td> <td style="text-align:right;"> 0.43 </td> <td style="text-align:right;"> 136 </td> <td style="text-align:right;"> 19 </td> <td style="text-align:right;"> 4.4 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> Heilmans Old Style </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 2.59 </td> <td style="text-align:right;"> 0.43 </td> <td style="text-align:right;"> 144 </td> <td style="text-align:right;"> 24 </td> <td style="text-align:right;"> 4.9 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> Rolling Rock </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 2.15 </td> <td style="text-align:right;"> 0.36 </td> <td style="text-align:right;"> 144 </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 4.7 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> Scotch Buy (Safeway) </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 1.59 </td> <td style="text-align:right;"> 0.27 </td> <td style="text-align:right;"> 145 </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 4.5 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> St Pauli Girl </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> Germany </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 4.59 </td> <td style="text-align:right;"> 0.77 </td> <td style="text-align:right;"> 144 </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 4.7 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> <tr> <td style="text-align:left;"> Tuborg </td> <td style="text-align:left;"> Fair </td> <td style="text-align:left;"> USA </td> <td style="text-align:left;"> Regional </td> <td style="text-align:right;"> 2.59 </td> <td style="text-align:right;"> 0.43 </td> <td style="text-align:right;"> 155 </td> <td style="text-align:right;"> 13 </td> <td style="text-align:right;"> 5.0 </td> <td style="text-align:left;"> NONLIGHT </td> </tr> </tbody> </table></div> --- # Details - To keep the example simple only the beers rated *fair* are used<!--D--> -- - In general, all the beers can be used.<!--D--> -- - Also to keep things simple we only consider the metric variables so that we can use Euclidean distance.<!--D--> -- - In general, we can use distance metrics that work for categorical data. --- # Metric Variables - After standardising, Euclidean distances are formed between every possible pair of beers.<!--D--> -- - For example, the distance between Blatz and Tuborg is given by<!--D--> -- `$$\delta\left(\mbox{Blatz},\mbox{Tbrg}\right)=\sqrt{\sum\limits_{h=1}^5(\mbox{Blatz}_h-\mbox{Tbrg}_h)^2}$$` Both the notation `\(\delta_{ij}\)` and `\(\delta(i,j)\)` will be used interchangeably --- # Doing it in R To obtain the distance matrix in R ```r filter(Beer,rating=='Fair')%>% #Only fair beers select_if(is.numeric)%>% #Only metric data scale%>% #Standardise dist->delta #Distance filter(Beer,rating=='Fair')%>% #Only fair beers pull(beer)%>% #Get beer names abbreviate(6)-> #Abbreviate attributes(delta)$Labels #Assign to d ``` --- #MDS in R We can do what is known as **classical** MDS in R using the `cmdscale` function ```r mdsout<-cmdscale(delta) ``` <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:200px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td style="text-align:left;"> OlymGL </td> <td style="text-align:right;"> -1.9758212 </td> <td style="text-align:right;"> -1.6276821 </td> </tr> <tr> <td style="text-align:left;"> PbstEL </td> <td style="text-align:right;"> -2.1860282 </td> <td style="text-align:right;"> -1.1600914 </td> </tr> <tr> <td style="text-align:left;"> SchltL </td> <td style="text-align:right;"> -0.7420968 </td> <td style="text-align:right;"> -0.7994497 </td> </tr> <tr> <td style="text-align:left;"> Blatz </td> <td style="text-align:right;"> -0.3386684 </td> <td style="text-align:right;"> 1.4929936 </td> </tr> <tr> <td style="text-align:left;"> Hamms </td> <td style="text-align:right;"> 0.6053483 </td> <td style="text-align:right;"> 0.2720245 </td> </tr> <tr> <td style="text-align:left;"> HlmnOS </td> <td style="text-align:right;"> 1.3641181 </td> <td style="text-align:right;"> 0.6556403 </td> </tr> <tr> <td style="text-align:left;"> RllngR </td> <td style="text-align:right;"> -0.2932490 </td> <td style="text-align:right;"> 0.9661501 </td> </tr> <tr> <td style="text-align:left;"> SB(Sf) </td> <td style="text-align:right;"> -0.2067650 </td> <td style="text-align:right;"> 1.7951323 </td> </tr> <tr> <td style="text-align:left;"> StPlGr </td> <td style="text-align:right;"> 2.9648884 </td> <td style="text-align:right;"> -2.2976842 </td> </tr> <tr> <td style="text-align:left;"> Tuborg </td> <td style="text-align:right;"> 0.8082737 </td> <td style="text-align:right;"> 0.7029665 </td> </tr> </tbody> </table></div> --- # Two new variables - We have just created two new variables for visualising the distances.<!--D--> -- - The distances that we visualise will be 2-dimensional distances. For example<!--D--> -- `$$\begin{align}d&(\mbox{Blatz},\mbox{Tbrg})=\\&\sqrt{(-0.339-0.808)^2+(1.493-0.703)^2}\end{align}$$` --- # Not exact - In this example `\(d(\mbox{Blatz},\mbox{Tuborg})=1.3927\)` while `\(\delta(\mbox{Blatz},\mbox{Tuborg})=1.4762\)`. Notice that<!--D--> -- $$ d(\mbox{Blatz},\mbox{Tuborg})\neq \delta(\mbox{Blatz},\mbox{Tuborg}) $$ - But they are close. --- # Getting the plot ```r mdsout%>% as_tibble%>% ggplot(aes(x=V1,y=V2))+geom_point() ``` <img src="MDS_files/figure-html/getplot-1.png" style="display: block; margin: auto;" /> --- # Getting the plot with names ```r mdsout%>% as_tibble(rownames='BeerName')%>% ggplot(aes(x=V1,y=V2,label=BeerName))+geom_text() ``` <img src="MDS_files/figure-html/getplotn-1.png" style="display: block; margin: auto;" /> --- # The math behind classical MDS - In *classical* MDS the objective is to minimise strain<!--D--> -- `$$\mbox{Strain}=\sum\limits_{i=1}^{n-1}\sum\limits_{j>i}(\delta^2_{ij}-d^2_{ij})$$` - Note that the `\(\delta_{ij}\)` are high dimensional distances that come from the true data. -- - The `\(d_{ij}\)` are low dimensional distances that come from the solution. --- # When can this be solved? - The above problem has a tractable solution when Euclidean distance is used. -- - This solution depends on an Eigenvalue decomposition. -- - This solution *rotates* the points until we get a 2D view that represents the true distances as accurately as possible. --- # Summary - When Euclidean distance is used the solution provided by classical MDS: -- - Minimises the strain. -- - Results in eigenvalues that are all positive -- - Can we use classical MDS when distances are non-Euclidean? --- # An example: Road distances - Suppose that we have the road distances between different cities in Australia.<!--D--> -- - The road distances are non-Euclidean since roads can be quite wiggly.<!--D--> -- - We want to create a 2-dimensional map with the locations of the cities using only these road distances. -- - Classical MDS can give an approximation that is quite close to a real map. --- # Road Distances <div style="border: 1px solid #ddd; padding: 0px; overflow-y: scroll; height:500px; "><table class="table table-striped table-hover table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;position: sticky; top:0; background-color: #FFFFFF;"> </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Cairns </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Brisbane </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Sydney </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Melbourne </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Adelaide </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Perth </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Darwin </th> <th style="text-align:right;position: sticky; top:0; background-color: #FFFFFF;"> Alice Springs </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Cairns </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1717 </td> <td style="text-align:right;"> 2546 </td> <td style="text-align:right;"> 3054 </td> <td style="text-align:right;"> 3143 </td> <td style="text-align:right;"> 5954 </td> <td style="text-align:right;"> 2727 </td> <td style="text-align:right;"> 2324 </td> </tr> <tr> <td style="text-align:left;"> Brisbane </td> <td style="text-align:right;"> 1717 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 996 </td> <td style="text-align:right;"> 1674 </td> <td style="text-align:right;"> 2063 </td> <td style="text-align:right;"> 4348 </td> <td style="text-align:right;"> 3415 </td> <td style="text-align:right;"> 3012 </td> </tr> <tr> <td style="text-align:left;"> Sydney </td> <td style="text-align:right;"> 2546 </td> <td style="text-align:right;"> 996 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 868 </td> <td style="text-align:right;"> 1420 </td> <td style="text-align:right;"> 4144 </td> <td style="text-align:right;"> 4000 </td> <td style="text-align:right;"> 2644 </td> </tr> <tr> <td style="text-align:left;"> Melbourne </td> <td style="text-align:right;"> 3054 </td> <td style="text-align:right;"> 1674 </td> <td style="text-align:right;"> 868 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 728 </td> <td style="text-align:right;"> 3452 </td> <td style="text-align:right;"> 3781 </td> <td style="text-align:right;"> 2270 </td> </tr> <tr> <td style="text-align:left;"> Adelaide </td> <td style="text-align:right;"> 3143 </td> <td style="text-align:right;"> 2063 </td> <td style="text-align:right;"> 1420 </td> <td style="text-align:right;"> 728 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 2724 </td> <td style="text-align:right;"> 3053 </td> <td style="text-align:right;"> 1542 </td> </tr> <tr> <td style="text-align:left;"> Perth </td> <td style="text-align:right;"> 5954 </td> <td style="text-align:right;"> 4348 </td> <td style="text-align:right;"> 4144 </td> <td style="text-align:right;"> 3452 </td> <td style="text-align:right;"> 2724 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 4045 </td> <td style="text-align:right;"> 3630 </td> </tr> <tr> <td style="text-align:left;"> Darwin </td> <td style="text-align:right;"> 2727 </td> <td style="text-align:right;"> 3415 </td> <td style="text-align:right;"> 4000 </td> <td style="text-align:right;"> 3781 </td> <td style="text-align:right;"> 3053 </td> <td style="text-align:right;"> 4045 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1511 </td> </tr> <tr> <td style="text-align:left;"> Alice Springs </td> <td style="text-align:right;"> 2324 </td> <td style="text-align:right;"> 3012 </td> <td style="text-align:right;"> 2644 </td> <td style="text-align:right;"> 2270 </td> <td style="text-align:right;"> 1542 </td> <td style="text-align:right;"> 3630 </td> <td style="text-align:right;"> 1511 </td> <td style="text-align:right;"> 0 </td> </tr> </tbody> </table></div> --- # Australia <img src="MDS_files/figure-html/ozmap-1.png" style="display: block; margin: auto;" /> --- # MDS Solution <img src="MDS_files/figure-html/mdsoz-1.png" style="display: block; margin: auto;" /> --- # Rotate <img src="MDS_files/figure-html/mdsoz2-1.png" style="display: block; margin: auto;" /> --- # Back with Map <img src="MDS_files/figure-html/mdssolmap-1.png" style="display: block; margin: auto;" /> --- #Rotating - Once a solution is available, we rotate the points within 2 dimensions.<!--D--> -- - The 2D rotation does not change any of the distances.<!--D--> -- - It can help us to interpret the axes.<!--D--> -- - In the previous example the x-axis represents East-West direction and the y-axis represents North-South. --- class: inverse, center, middle # Evaluating MDS --- # How good is this representation? - In theory, as long as the original distances are Euclidean, strain is minimised. -- - What if the optimal solution is still bad? -- - Use two goodness of fit measures.<!--D--> -- - Think of these in a similar fashion to R square in regression modelling. --- # Goodness of Fit Measures? - These values depend on the eigenvalues `$$\mbox{GF}_1=\frac{\sum\limits_{i=1}^2 |\lambda_i|}{\sum\limits^n_{i=1}|\lambda_i|}\, \mbox{GF}_2=\frac{\sum\limits_{i=1}^2 max(0,\lambda_i)}{\sum\limits^n_{i=1}max(0,\lambda_i)}$$` - For Euclidean distances `\(\delta_{ij}\)` eigenvalues are always positive and `\(\mbox{GF}_1=\mbox{GF}_2\)`. --- # Beer Example - In R obtain GoF using the option `eig=TRUE` in the `cmdscale` function<!--D--> -- - For the Beer data. ```r mdsout<-cmdscale(delta,eig=TRUE) str(mdsout$GOF) ``` ``` ## num [1:2] 0.854 0.854 ``` --- # GoF Measure - You may notice that the GoF measures are the same. -- - This is always the case when Euclidean distance is used. -- - This arises since all eigenvalues are positive when the distance matrix is based on Euclidean distance. --- # Non-Euclidean distances - In theory non-Euclidean distances can lead to negative eigenvalues. In this case:<!--D--> -- + Classical MDS may not minimise Strain.<!--D--> -- + It minimises a slightly different function of the distances. -- + Two fit measures will differ.<!--D--> -- - Overall, we can use classical MDS for non-Euclidean distance but must be more careful. --- # Australia data ```r cmdscale(doz,eig=TRUE)->dozout str(dozout$eig) ``` ``` ## num [1:8] 1.97e+07 1.25e+07 2.62e+06 5.96e+04 -3.26e-09 ... ``` ```r str(dozout$eig[6:8]) ``` ``` ## num [1:3] -311786 -1083294 -2179888 ``` ```r str(dozout$GOF) ``` ``` ## num [1:2] 0.837 0.923 ``` --- # Evaluating the Result - There are negative eigenvalues.<!--D--> -- + This occurs since road distances are not Euclidean<!--D--> -- + This also implies that classical MDS does not minimise strain.<!--D--> -- - Both goodness of fit measures are quite high.<!--D--> -- + The solution is an accurate representation. --- # Another example: Cheese The following example comes from ‘Multidimensional Scaling of Sorting Data Applied to Cheese Perception’, Food Quality and Preference,6, pp.91-98. The purpose of this study was to visualise the difference between types of cheese. --- # Another example: Cheese - The motivation is to investigate the similarities and differences between types of cheese.<!--D--> -- - In principle one could measure attributes of the cheese.<!--D--> -- - However the purpose of this study was to ask customers about their perceptions.<!--D--> -- - How do we ask customers about distances?<!--D--> -- - Could you walk out on to the street and ask someone about the Euclidean distance between Brie and Camembert? --- # Constructing the Survey - Customers can be asked:<!--D--> -- - On a scale of 1 to 10 with 1 being the most similar and 10 being the most different, how similar are the following cheeses<!--D--> -- + Brie and Camembert<!--D--> -- + Brie and Roquefort<!--D--> -- + Camembert and Roquefort -- - The dissimilarity scores can be averaged over all customers and used in an MDS<!--D--> -- - This is not a good method when there is a large number of products. --- # A more feasible approach - In the study there are 16 cheeses therefore 120 possible pairwise comparisons.<!--D--> -- - It is not practical to ask survey participants to make 120 comparisons!<!--D--> -- - Instead of being asked to make so many comparisons, customers were asked to put similar cheeses into groups.<!--D--> -- - Proportion of customers with two cheeses in same group is a similarity score.<!--D--> -- - Proportion of customers with two cheeses in different groups is a dissimilarity score. --- # Consider four customers - Suppose there are four customers sorting cheeses<!--D--> -- + Customer A: Brie and Camembert together, Roquefort and Blue Vein together<!--D--> -- + Customer B: Roquefort and Blue Vein together, all others separate<!--D--> -- + Customer C: All cheeses in their own category<!--D--> -- + Customer D: All cheeses in one category --- # Comparisons - Customer A and D have Brie and Camembert in the same group, customers B and C have them in different groups.<!--D--> -- + The distance between Brie and Camembert is 0.5.<!--D--> -- - Customer A, B and D have Roquefort and Blue Vein in the same group, customer C has them in different groups.<!--D--> -- + The distance between Roquefort and Blue Vein is 0.25. --- # MDS - The study on cheese did not use classical MDS but something called *Kruskals algorithm*. - There are many alternatives to classical MDS. - We now briefly cover some of the ideas behind them. --- class: inverse, center, middle # Beyond Classical MDS --- # Beyond Classical MDS - Classical MDS is designed to minimise Strain.<!--D--> -- - An alternative objective function called Stress can be minimised instead<!--D--> -- `$$\mbox{Stress}=\sum\limits_{i=1}^{n-1}\sum\limits_{j>i}\frac{(\delta_{ij}-d_{ij})^2}{\delta_{ij}}$$` -- - The difference between `\(\delta_{ij}\)` and `\(d_{ij}\)` acts like an error. -- - The `\(\delta_{ij}\)` on the denominator acts as a weight --- #Weighting - For large `\(\delta\)` observations are far away in the original space. -- - For these pairs errors are more easily tolerated. -- - For small `\(\delta\)` observations are close in the original space. -- - For these pairs errors are not tolerated. -- - The most accuracy is achieved for nearby points -- - The local structure is preserved. --- # Sammon mapping - The Sammon mapping is solved by numerical optimisation. -- - It is different from the classical solution -- - It is not based on an eigenvalue decomposition -- - It is not based on rotation -- - It is a non-linear mapping. --- # Example - Consider the case where points are in 2D space and the aim is to summarise them in 1D space (across a line). - The specific problem of doing multidimensional scaling where the lower dimension is 1 is called *seriation*. - It provides a ranking of the observations. -- - In marketing it can be used to elicit preferences. --- # Original Data <img src="MDS_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Original Data <img src="MDS_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Rotate (Classical Solution) <img src="MDS_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- # Keep 1 Dimension <img src="MDS_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- # Rug plot (classical solution) <img src="MDS_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> --- # Sammon Mapping <img src="MDS_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> --- # Discussion - Classical MDS cannot account for non-linearity. - The dark blue and yellow points are represented as close to one another. - Sammon does account for non-linearity. - The blue and yellow points are represented as far apart. - Although they are not so far apart in the original space, these observations are downweighted relative to the local structure. --- # Kruskal algorithm - Kruskal's algorithm minimises a slightly different criterion. -- - This is still often called *stress*, which is admittedly confusing. -- - Kruskal's algorithm is implemented in R using the `isoMDS` function from the `MASS` package. --- # Monotone tranformations - Kruskal's algorithm is invariant to monotone transformations of the distances. -- - By *monotone transformation* we mean any function of the distance that is either constantly increasing or decreasing. -- - Exponential function is monotone -- - Sine function is not monotone -- - By *invariant* we mean that the solution provided by Kruskal's does not change if we transform the input distances. --- # Example ```r library(MASS) isoMDS(d)->kBeer ``` ``` ## initial value 9.127089 ## iter 5 value 5.688460 ## final value 5.611143 ## converged ``` --- # Make plot ```r kBeer$points%>% as_tibble()%>% ggplot(aes(x=V1,y=V2))+ geom_point(size=10) ``` --- # Make plot <img src="MDS_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # Squared distances ```r isoMDS(d^2)->kBeer2 ``` ``` ## initial value 11.274285 ## iter 5 value 6.447929 ## iter 10 value 5.697285 ## final value 5.603035 ## converged ``` --- # Solution <img src="MDS_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Comparison - Squaring the distances provides the same solution with two caveats: -- - The stress is slightly different. Numerical optimisation can vary a little depending on starting values. -- - The points in one plot are slightly rotated compared to the other. -- - Why is the invariance to monotone tranformations important? --- # Non metric MDS - In some cases, the distance themselves are not metric but ordinal.<!--D--> -- - Suppose we only know<!--D--> -- `$$\delta_{\mbox{Bri.},\mbox{Cam.}}< \delta_{\mbox{Roq.},\mbox{Cam.}}< \delta_{\mbox{Roq.},\mbox{Bri.}}$$` -- - Brie and Roquefort are *more different* compared to Brie and Camembert.<!--D--> -- - We do not know *how big* the distance between Brie and Roquefort is compared to the distance between Brie and Camembert. --- # Non-metric MDS - In this case we minimise Stress subject to constraints, e.g.<!--D--> -- `$$\hat{\delta}_{\mbox{Bri.},\mbox{Cam.}}< \hat{\delta}_{\mbox{Roq.},\mbox{Cam.}}< \hat{\delta}_{\mbox{Roq.},\mbox{Bri.}}$$` --- # Non-metric MDS - Taking the ranks is an example of a monotone transformation. -- - Therefore the solution of isoMDS only requires the ranks of the distances and not the distances themselves. -- - This is a very useful algorithm for marketing since survey participants cannot easily and reliable assign numbers to the difference between products.<!--D--> --- # Modern MDS - Methods for finding a low dimensional representation of high-dimensional data continue to be used today -- - These mostly go by the name of **manifold learning** methods -- - They are not only used for visualisation -- - The low-dimensional co-ordinates can also be used as features in classification and regression. --- # Examples - Local Linear Embedding (LLE) - IsoMap - Laplacian Eigenmap - t SNE - Kohonen Map - ... -- - ... and others. --- # Properties - For most of the modern methods two characteristics are common. -- - The idea that local structure should be preserved. The first step of many algorithms is to find nearest neighbours of each point. -- - In many algorithms an eigenvalue decomposition forms part of the solution as is the case in classic MDS.