DistanceHigh Dimensional Data AnalysisAnastasios Panagiotelis & Ruben Loaiza-MayaLecture 31

Why distance?2

Why distance?Many problems that involve thinking about how similar or dissimilar two observations are.  For example:
3

Why distance?Many problems that involve thinking about how similar or dissimilar two observations are.  For example:May use the same marketing strategy for similar demographic groups.
May lend money to applicants who are similar to those who pay debts back.

3

Why distance?Many problems that involve thinking about how similar or dissimilar two observations are.  For example:May use the same marketing strategy for similar demographic groups.
May lend money to applicants who are similar to those who pay debts back.

Arguably the most important concept in data analysis is distance
3

Simple exampleConsider 3 individuals:
4

Simple exampleConsider 3 individuals:Mr Orange: 37 years of age earns $75k a year
Mr Red: 31 years of age earns $67k a year
Mr Blue: 30 years of age earns $68k a year

4

Simple exampleConsider 3 individuals:Mr Orange: 37 years of age earns $75k a year
Mr Red: 31 years of age earns $67k a year
Mr Blue: 30 years of age earns $68k a year

Which two are the most similar?
4

On a scatterplot

Distance as a numberIt is easy to think about three individuals but what if there are thousands of individuals?
6

Distance as a numberIt is easy to think about three individuals but what if there are thousands of individuals?In this case it will be useful to attach some number to the distance between pairs of individuals

6

Distance as a numberIt is easy to think about three individuals but what if there are thousands of individuals?In this case it will be useful to attach some number to the distance between pairs of individuals
We will do it with a simple application of Pythagoras' theorem.

6

Finding the Distance

Euclidean distanceIn general there are more than two variables.
16

Euclidean distanceIn general there are more than two variables.
Is there a way to apply our intuition in 2 dimensions to higher dimensions?
16

Euclidean distanceIn general there are more than two variables.
Is there a way to apply our intuition in 2 dimensions to higher dimensions?Pythagoras' theorem can be generalised to higher dimensions.

16

Euclidean distanceIn general there are more than two variables.
Is there a way to apply our intuition in 2 dimensions to higher dimensions?Pythagoras' theorem can be generalised to higher dimensions.
This results in a concept of distance called Euclidean distance.

16

Euclidean distance

We measure $p$ variables for two observations: $x_{j}$ is the measurement of variable $j$ for observation $x$ , $y_{j}$ is the measurement of variable $j$ for observation $y$ . Euclidean distance between $x$ and $y$ is:

$D (x, y) = \sqrt{\sum_{j = 1}^{p} {(x_{j} - y_{j})}^{2}}$

VectorsNotice that xx and yy are examples of vectors.
18

VectorsNotice that xx and yy are examples of vectors.
For example x=(x1x2)x=(x1x2) where x1x1 is age and x2x2 is income.
18

VectorsNotice that xx and yy are examples of vectors.
For example x=(x1x2)x=(x1x2) where x1x1 is age and x2x2 is income.
We can think of a data point as 
18

VectorsNotice that xx and yy are examples of vectors.
For example x=(x1x2)x=(x1x2) where x1x1 is age and x2x2 is income.
We can think of a data point as A vector of attributes or measurements 

18

VectorsNotice that xx and yy are examples of vectors.
For example x=(x1x2)x=(x1x2) where x1x1 is age and x2x2 is income.
We can think of a data point as A vector of attributes or measurements 
A point in space 

18

VectorsNotice that xx and yy are examples of vectors.
For example x=(x1x2)x=(x1x2) where x1x1 is age and x2x2 is income.
We can think of a data point as A vector of attributes or measurements 
A point in space 

These are the same thing.
18

Other kinds of distance

We will nearly always use Euclidean Distance in this unit, however there are other ways of understanding distance
One example is the Manhattan Distance also known as block distance.

$D (x, y) = \sum_{j = 1}^{p} | x_{j} - y_{j} |$

Manhattan Distance

Distance and Standardising dataWe must be careful about the units of measurement.
21

Distance and Standardising dataWe must be careful about the units of measurement.
Euclidean (and Manhattan) distance change for variables measured in different units.
21

Distance and Standardising dataWe must be careful about the units of measurement.
Euclidean (and Manhattan) distance change for variables measured in different units.
For this reason, it is common to calculate distance after the standardising data.
21

Distance and Standardising dataWe must be careful about the units of measurement.
Euclidean (and Manhattan) distance change for variables measured in different units.
For this reason, it is common to calculate distance after the standardising data.
If the variables are all measured in the same units, then this standardisation is unecessary.
21

Distance and Standardising dataWe must be careful about the units of measurement.
Euclidean (and Manhattan) distance change for variables measured in different units.
For this reason, it is common to calculate distance after the standardising data.
If the variables are all measured in the same units, then this standardisation is unecessary.
Some distances are not sensitive to units of measurement (e.g. Mahalanobis Distance)
21

Distance in RR has its own special object for distances known as a dist object
22

Distance in RR has its own special object for distances known as a dist object
It can be obtained using the dist() function
22

Distance in RR has its own special object for distances known as a dist object
It can be obtained using the dist() function
We are going to find Euclidean distances between the beers in the beers dataset. Use:  
22

Distance in RR has its own special object for distances known as a dist object
It can be obtained using the dist() function
We are going to find Euclidean distances between the beers in the beers dataset. Use:  Only beers with price greater than $4.50
Only numeric variables.
Standardised data
Use the function dist to get the distances.

22

Load packages and data

library(dplyr)
Beer<-readRDS('Beer.rds')

Find Distances

Beer%>%filter(price>4.5)%>% #Only expensive Beers
  select_if(is.numeric)%>% #Only numeric variables
  scale%>%
  dist->d

1	2	3	4	5
0.0000	3.4298	3.8333	4.1632	4.1950
3.4298	0.0000	2.3009	2.8076	1.6260
3.8333	2.3009	0.0000	1.1482	3.2339
4.1632	2.8076	1.1482	0.0000	3.3188
4.1950	1.6260	3.2339	3.3188	0.0000

LabelsOnly numeric variables were used to compute distances. 
The names of the beers are not attached to the dist object.
This can be achived by assigning the beer names to attributes(d)$Labels
Here d is the dist object.
25

Use Beer Names

Beer%>%filter(price>4.5)%>% #Only expensive Beers
  pull(beer)-> #Get beer names
  attributes(d)$Labels #"Attach" them to dist object

	Anchor Steam	Becks	Heineken	Kirin	St Pauli Girl
Anchor Steam	0.0000	3.4298	3.8333	4.1632	4.1950
Becks	3.4298	0.0000	2.3009	2.8076	1.6260
Heineken	3.8333	2.3009	0.0000	1.1482	3.2339
Kirin	4.1632	2.8076	1.1482	0.0000	3.3188
St Pauli Girl	4.1950	1.6260	3.2339	3.3188	0.0000

Your TurnCompute the distance without standardising the data.
27

Your TurnCompute the distance without standardising the data.
Compute the Manhattan distance for standardised data.
27

Your TurnCompute the distance without standardising the data.
Compute the Manhattan distance for standardised data.
Compute the Manhattan distance for unstandardised data.
27

Non-Metric28

Non-metric DataCan we define distance when the variables are non metric?
29

Non-metric DataCan we define distance when the variables are non metric?
The answer is yes!
29

Non-metric DataCan we define distance when the variables are non metric?
The answer is yes!
We will discuss two approaches: 
29

Non-metric DataCan we define distance when the variables are non metric?
The answer is yes!
We will discuss two approaches: Jaccard Similarity/ Distance
Dummy Variables

29

First a motivationMany people use music streaming services like Spotify.  
30

First a motivationMany people use music streaming services like Spotify.  
One of the attractions of these services is they they recommend artists based on the favourite artists of other users who have similar taste in music.
30

First a motivationMany people use music streaming services like Spotify.  
One of the attractions of these services is they they recommend artists based on the favourite artists of other users who have similar taste in music.
The data in this case is in the form of a list of favourite artists.
30

Distance in musical tasteSuppose there are three customers with the following favourite artists
31

Distance in musical tasteSuppose there are three customers with the following favourite artistsCustomer A: Post Malone, Drake, Lil Peep, Billie Eilish
Customer B: Post Malone, Lil Peep, Juice Wrld
Customer C: Billie Eilish, Ed Sheeran, Ariana Grande

31

Distance in musical tasteSuppose there are three customers with the following favourite artistsCustomer A: Post Malone, Drake, Lil Peep, Billie Eilish
Customer B: Post Malone, Lil Peep, Juice Wrld
Customer C: Billie Eilish, Ed Sheeran, Ariana Grande

How do we measure which customers have similar taste and which have different taste?
31

Jaccard Similarity and DistanceJaccard similarity gives us a measure of how close two sets are, in this case the set of each customers favourite musician.  The formula is
J(A,B)=|A∩B||A∪B|J(A,B)=|A∩B||A∪B|
Where |A∩B||A∩B| is the number of elements in both set A and set B and |A∪B||A∪B| is the number of elements in either set A or set B.
32

Jaccard SimilarityIn our example
33

Jaccard SimilarityIn our exampleA∩B={Post Malone, Lil Peep}A∩B={Post Malone, Lil Peep} 
|A∩B|=2|A∩B|=2
A∪B={Post Malone, Lil Peep,Drake, Billie Eilish, Juice Wrld}A∪B={Post Malone, Lil Peep,Drake, Billie Eilish, Juice Wrld} 
|A∪B|=5|A∪B|=5

33

Jaccard SimilarityIn our exampleA∩B={Post Malone, Lil Peep}A∩B={Post Malone, Lil Peep} 
|A∩B|=2|A∩B|=2
A∪B={Post Malone, Lil Peep,Drake, Billie Eilish, Juice Wrld}A∪B={Post Malone, Lil Peep,Drake, Billie Eilish, Juice Wrld} 
|A∪B|=5|A∪B|=5

The Jaccard similarity will be J=2/5=0.4J=2/5=0.4.  The Jaccard distance is dJ=1−J=1−0.4=0.6dJ=1−J=1−0.4=0.6
33

Using dummy variablesAlternatively the same data can be coded using dummy variables:
34

Using dummy variablesAlternatively the same data can be coded using dummy variables:Xj=1Xj=1 if artist jj is a favourite of customer xx
Xj=0Xj=0 otherwise

34

Using dummy variablesAlternatively the same data can be coded using dummy variables:Xj=1Xj=1 if artist jj is a favourite of customer xx
Xj=0Xj=0 otherwise

The usual distance measures such as Euclidean or Manhattan distance can then be used.
34

Collaborative Filtering

Figure by Mohamed Ben Ellefi

Recommender SystemsFamous recommender systems are used by Amazon, Netflix, Alibaba amongst others.
These systems are usually a hybrid of Collaborative Filtering
Content-based Filtering

The method we discussed is more specifically called memory-based collaborative filtering.
36

Axioms of DistanceCare should be taken when using the word distance.
In formal mathematics a distance is a function with two inputs that has to satisfy four properties.
These four properties are called axioms.
The distance measures that we have discussed satisfy the axioms
37

Axioms of DistanceNon-negative: d(x,y)≥0d(x,y)≥0There cannot be negative distance.

Symmetry: d(x,y)=d(y,x)d(x,y)=d(y,x)It cannot be a different distance from Melbourne to Brisbane that from Brisbane to Melbourne.

38

Axioms of DistanceIdentity of indiscernables: if d(x,y)=0d(x,y)=0 then x=yx=y and vice versaThe distance from Melbourne to some place is zero then that place is Melbourne.  Similarly the distance from Melbourne to itself is zero.

Triangle inequality: d(x,z)≤d(x,y)+d(y,z)d(x,z)≤d(x,y)+d(y,z) 
It cannot be closer to go from Melbourne to Brisbane via Sydney than it is to go from Melbourne to Brisbane directly.
39

ConclusionsThat concludes the topic on distance.
40

ConclusionsThat concludes the topic on distance.
This is relevant to the following topics
40

ConclusionsThat concludes the topic on distance.
This is relevant to the following topicsCluster Analysis

40

ConclusionsThat concludes the topic on distance.
This is relevant to the following topicsCluster Analysis
Multidimensional Scaling (MDS)

40

ConclusionsThat concludes the topic on distance.
This is relevant to the following topicsCluster Analysis
Multidimensional Scaling (MDS)

Now an exercise
40

Distances between tweetsFind someone on Twitter or a similar social media siteFind the first two tweets
Think of a way to compute a Jaccard distance between their tweets

Hint: Think of the words used in the tweet as a set
41

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help