class: center, middle, inverse, title-slide # Distance ## High Dimensional Data Analysis ### Anastasios Panagiotelis & Ruben Loaiza-Maya ### Lecture 3 --- class: inverse, center, middle # Why distance? --- # Why distance? - Many problems that involve thinking about how *similar* or dissimilar two observations are. For example:<!--D--> -- + May use the same marketing strategy for *similar* demographic groups. + May lend money to applicants who are *similar* to those who pay debts back.<!--D--> -- - Arguably the most important concept in data analysis is *distance* --- # Simple example - Consider 3 individuals:<!--D--> -- + Mr Orange: 37 years of age earns $75k a year + Mr Red: 31 years of age earns $67k a year + Mr Blue: 30 years of age earns $68k a year<!--D--> -- - Which two are the most similar? --- # On a scatterplot <img src="Distance_files/figure-html/scatter-1.png" style="display: block; margin: auto;" /> --- # Distance as a number - It is easy to think about three individuals but what if there are thousands of individuals?<!--D--> -- + In this case it will be useful to attach some number to the distance between pairs of individuals<!--D--> -- + We will do it with a simple application of Pythagoras' theorem. --- # Finding the Distance <img src="Distance_files/figure-html/pytha-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythb-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythc-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythd-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythe-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythf-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythg-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythh-1.png" style="display: block; margin: auto;" /> --- # Finding the Distance <img src="Distance_files/figure-html/pythi-1.png" style="display: block; margin: auto;" /> --- # Euclidean distance - In general there are more than two variables.<!--D--> -- - Is there a way to apply our intuition in 2 dimensions to higher dimensions?<!--D--> -- + Pythagoras' theorem can be *generalised* to higher dimensions.<!--D--> -- + This results in a concept of distance called *Euclidean distance*. --- # Euclidean distance We measure `\(p\)` variables for two observations: `\(x_{j}\)` is the measurement of variable `\(j\)` for observation `\({\mathbf x}\)`, `\(y_{j}\)` is the measurement of variable `\(j\)` for observation `\({\mathbf y}\)`. *Euclidean* distance between `\({\mathbf x}\)` and `\({\mathbf y}\)` is: `$$D\left({\mathbf x},{\mathbf y}\right)=\sqrt{\sum\limits_{j=1}^p \left(x_{j}-y_{j}\right)^2}$$` --- # Vectors - Notice that `\({\mathbf x}\)` and `\({\mathbf y}\)` are examples of **vectors**. -- - For example `\({\mathbf x}=\begin{pmatrix}x_1\\x_2\end{pmatrix}\)` where `\(x_1\)` is age and `\(x_2\)` is income. -- - We can think of a data point as -- + A vector of attributes or measurements -- + A point in space -- - These are the same thing. --- # Other kinds of distance - We will nearly always use Euclidean Distance in this unit, however there are other ways of understanding distance - One example is the *Manhattan Distance* also known as block distance. `$$D\left({\mathbf x},{\mathbf y}\right)=\sum\limits_{j=1}^p \left|x_{j}-y_{j}\right|$$` --- # Manhattan Distance ![Manhattan Distance](ManhattanDistance.png) --- # Distance and Standardising data - We must be careful about the units of measurement.<!--D--> -- - Euclidean (and Manhattan) distance change for variables measured in *different units*.<!--D--> -- - For this reason, it is common to calculate distance after the *standardising* data.<!--D--> -- - If the variables are all measured in the same units, then this standardisation is unecessary.<!--D--> -- - Some distances are not sensitive to units of measurement (e.g. Mahalanobis Distance) --- # Distance in R - R has its own special object for distances known as a `dist` object<!--D--> -- - It can be obtained using the `dist()` function<!--D--> -- - We are going to find Euclidean distances between the beers in the beers dataset. Use:<!--D--> -- + Only beers with price greater than $4.50 + Only numeric variables. + Standardised data + Use the function `dist` to get the distances. --- # Load packages and data ```r library(dplyr) Beer<-readRDS('Beer.rds') ``` --- # Find Distances ```r Beer%>%filter(price>4.5)%>% #Only expensive Beers select_if(is.numeric)%>% #Only numeric variables scale%>% dist->d ``` <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 3.4298 </td> <td style="text-align:right;"> 3.8333 </td> <td style="text-align:right;"> 4.1632 </td> <td style="text-align:right;"> 4.1950 </td> </tr> <tr> <td style="text-align:right;"> 3.4298 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 2.3009 </td> <td style="text-align:right;"> 2.8076 </td> <td style="text-align:right;"> 1.6260 </td> </tr> <tr> <td style="text-align:right;"> 3.8333 </td> <td style="text-align:right;"> 2.3009 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 1.1482 </td> <td style="text-align:right;"> 3.2339 </td> </tr> <tr> <td style="text-align:right;"> 4.1632 </td> <td style="text-align:right;"> 2.8076 </td> <td style="text-align:right;"> 1.1482 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 3.3188 </td> </tr> <tr> <td style="text-align:right;"> 4.1950 </td> <td style="text-align:right;"> 1.6260 </td> <td style="text-align:right;"> 3.2339 </td> <td style="text-align:right;"> 3.3188 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> --- # Labels - Only numeric variables were used to compute distances. - The names of the beers are not attached to the `dist` object. - This can be achived by assigning the beer names to `attributes(d)$Labels` - Here `d` is the `dist` object. --- # Use Beer Names ```r Beer%>%filter(price>4.5)%>% #Only expensive Beers pull(beer)-> #Get beer names attributes(d)$Labels #"Attach" them to dist object ``` <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Anchor Steam </th> <th style="text-align:right;"> Becks </th> <th style="text-align:right;"> Heineken </th> <th style="text-align:right;"> Kirin </th> <th style="text-align:right;"> St Pauli Girl </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Anchor Steam </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 3.4298 </td> <td style="text-align:right;"> 3.8333 </td> <td style="text-align:right;"> 4.1632 </td> <td style="text-align:right;"> 4.1950 </td> </tr> <tr> <td style="text-align:left;"> Becks </td> <td style="text-align:right;"> 3.4298 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 2.3009 </td> <td style="text-align:right;"> 2.8076 </td> <td style="text-align:right;"> 1.6260 </td> </tr> <tr> <td style="text-align:left;"> Heineken </td> <td style="text-align:right;"> 3.8333 </td> <td style="text-align:right;"> 2.3009 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 1.1482 </td> <td style="text-align:right;"> 3.2339 </td> </tr> <tr> <td style="text-align:left;"> Kirin </td> <td style="text-align:right;"> 4.1632 </td> <td style="text-align:right;"> 2.8076 </td> <td style="text-align:right;"> 1.1482 </td> <td style="text-align:right;"> 0.0000 </td> <td style="text-align:right;"> 3.3188 </td> </tr> <tr> <td style="text-align:left;"> St Pauli Girl </td> <td style="text-align:right;"> 4.1950 </td> <td style="text-align:right;"> 1.6260 </td> <td style="text-align:right;"> 3.2339 </td> <td style="text-align:right;"> 3.3188 </td> <td style="text-align:right;"> 0.0000 </td> </tr> </tbody> </table> --- # Your Turn - Compute the distance without standardising the data.<!--D--> -- - Compute the Manhattan distance for standardised data.<!--D--> -- - Compute the Manhattan distance for unstandardised data. --- class:inverse, middle, center # Non-Metric --- # Non-metric Data - Can we define distance when the variables are non metric?<!--D--> -- - The answer is yes!<!--D--> -- - We will discuss two approaches:<!--D--> -- + Jaccard Similarity/ Distance + Dummy Variables --- # First a motivation - Many people use music streaming services like Spotify. <!--D--> -- - One of the attractions of these services is they they recommend artists based on the favourite artists of other users who have similar taste in music.<!--D--> -- - The data in this case is in the form of a list of favourite artists. --- # Distance in musical taste - Suppose there are three customers with the following favourite artists<!--D--> -- + Customer A: Post Malone, Drake, Lil Peep, Billie Eilish + Customer B: Post Malone, Lil Peep, Juice Wrld + Customer C: Billie Eilish, Ed Sheeran, Ariana Grande<!--D--> -- - How do we measure which customers have similar taste and which have different taste? --- # Jaccard Similarity and Distance - Jaccard similarity gives us a measure of how close two *sets* are, in this case the set of each customers favourite musician. The formula is `$$J(A,B)=\frac{|A\cap B|}{|A\cup B|}$$` - Where `\(|A\cap B|\)` is the number of elements in both set A and set B and `\(|A\cup B|\)` is the number of elements in either set A or set B. --- # Jaccard Similarity - In our example<!--D--> -- + `\(A\cap B = \left\{\mbox{Post Malone, Lil Peep}\right\}\)` + `\(|A\cap B|=2\)` + `\(\begin{align}A\cup B =& \{\mbox{Post Malone, Lil Peep,}\\& \mbox{Drake, Billie Eilish, Juice Wrld}\}\end{align}\)` + `\(|A\cup B|=5\)`<!--D--> -- - The Jaccard similarity will be `\(J=2/5=0.4\)`. The Jaccard *distance* is `\(d_J=1-J=1-0.4=0.6\)` --- # Using dummy variables - Alternatively the same data can be coded using dummy variables:<!--D--> -- + `\(X_{j}=1\)` if artist `\(j\)` is a favourite of customer `\(x\)` + `\(X_{j}=0\)` otherwise<!--D--> -- - The usual distance measures such as Euclidean or Manhattan distance can then be used. --- # Collaborative Filtering <img src="CollaborativeFiltering.jpeg" width="1333" style="display: block; margin: auto;" /> Figure by Mohamed Ben Ellefi --- # Recommender Systems - Famous recommender systems are used by Amazon, Netflix, Alibaba amongst others. - These systems are usually a hybrid of - Collaborative Filtering - Content-based Filtering - The method we discussed is more specifically called memory-based collaborative filtering. --- # Axioms of Distance - Care should be taken when using the word distance. - In formal mathematics a distance is a function with two inputs that has to satisfy four properties. - These four properties are called *axioms*. - The distance measures that we have discussed satisfy the axioms --- # Axioms of Distance 1. Non-negative: `\(d(x,y)\ge 0\)` - There cannot be negative distance. 2. Symmetry: `\(d(x,y)=d(y,x)\)` - It cannot be a different distance from Melbourne to Brisbane that from Brisbane to Melbourne. --- # Axioms of Distance 3. Identity of indiscernables: if `\(d(x,y)=0\)` then `\(x=y\)` and vice versa - The distance from Melbourne to some place is zero then that place is Melbourne. Similarly the distance from Melbourne to itself is zero. 4. Triangle inequality: `\(d(x,z)\leq d(x,y)+d(y,z)\)` It cannot be closer to go from Melbourne to Brisbane via Sydney than it is to go from Melbourne to Brisbane directly. --- # Conclusions - That concludes the topic on distance.<!--D--> -- - This is relevant to the following topics<!--D--> -- + Cluster Analysis<!--D--> -- + Multidimensional Scaling (MDS)<!--D--> -- - Now an exercise --- # Distances between tweets - Find someone on Twitter or a similar social media site + Find the first two tweets + Think of a way to compute a Jaccard distance between their tweets - Hint: Think of the words used in the tweet as a *set*