Predictive AnalyticsData Visualisation and AnalyticsAnastasios Panagiotelis and Lauren KennedyLecture 81

Time for Analytics2

Minor adjustmentsSo far our attention has completely been on visualisation.
3

Minor adjustmentsSo far our attention has completely been on visualisation.
For the remainder of the unit we will focus on analytics
3

Minor adjustmentsSo far our attention has completely been on visualisation.
For the remainder of the unit we will focus on analytics
In particular our focus will be on making predictions.
3

Textbook

Much of the content of these slides is covered in Introduction to Statistical Learning by James, Witten Hastie, and Tibshirani.

Textbook

Much of the content of these slides is covered in Introduction to Statistical Learning by James, Witten Hastie, and Tibshirani.
In particular Chapters 2, 4 and 8.

Textbook

Much of the content of these slides is covered in Introduction to Statistical Learning by James, Witten Hastie, and Tibshirani.
In particular Chapters 2, 4 and 8.
The textbook is not mandatory but you may find it useful.

Textbook

Much of the content of these slides is covered in Introduction to Statistical Learning by James, Witten Hastie, and Tibshirani.
In particular Chapters 2, 4 and 8.
The textbook is not mandatory but you may find it useful.
It is available for free.

PredictionPrediction arises in many business contexts.
5

PredictionPrediction arises in many business contexts.
There is some unknown variable that is the target of the prediction.  This is usually denoted yy and may be called the dependent variable, or response or target variable.

5

PredictionPrediction arises in many business contexts.
There is some unknown variable that is the target of the prediction.  This is usually denoted yy and may be called the dependent variable, or response or target variable.

There are some known variables that are used to make the prediction.  These are usually denoted xx and may be called the independent variables, or predictors or features.

5

Supervised LearningFor some observations data will be available for both yy AND xx.
6

Supervised LearningFor some observations data will be available for both yy AND xx.
We can use these observations to learn some rule that gives predictions of yy as a function of xx.
6

Supervised LearningFor some observations data will be available for both yy AND xx.
We can use these observations to learn some rule that gives predictions of yy as a function of xx.
This prediction is denoted ^y=^f(x)y^=f^(x)
6

Supervised LearningFor some observations data will be available for both yy AND xx.
We can use these observations to learn some rule that gives predictions of yy as a function of xx.
This prediction is denoted ^y=^f(x)y^=f^(x)
This general setup is often called supervised learning.
6

Summary

Variable
Training
Evaluation


Predictor 1 (X1)
Data available
Data available

Predictor 2 (X2)
Data available
Data available

Dependent Variable (Y)
Data available
Data NOT available

7

Variable	Training	Evaluation
Predictor 1 (X1)	Data available	Data available
Predictor 2 (X2)	Data available	Data available
Dependent Variable (Y)	Data available	Data NOT available

Example

Variable
Old Customers
New Customer


Age (X1)
Data available
Data available

Limit  (X2)
Data available
Data available

Default (Y)
Data available
Data NOT available

8

Variable	Old Customers	New Customer
Age (X1)	Data available	Data available
Limit (X2)	Data available	Data available
Default (Y)	Data available	Data NOT available

RegressionSometimes yy is a numeric (metric) variable.  For exampleCompany profit next month.
Amount spent by a customer.
Demand for a new product.

9

RegressionSometimes yy is a numeric (metric) variable.  For exampleCompany profit next month.
Amount spent by a customer.
Demand for a new product.

In this case we are doing regression.
9

RegressionSometimes yy is a numeric (metric) variable.  For exampleCompany profit next month.
Amount spent by a customer.
Demand for a new product.

In this case we are doing regression.
This can be more general than the linear regression that you may be familiar with.
9

ClassificationSometimes yy is a categorical (nominal, non-metric) variable.  For exampleWill a borrower default on a loan?
Can we detect which tax returns are fraudulent?
Can we predict which brand customers will choose?

10

ClassificationSometimes yy is a categorical (nominal, non-metric) variable.  For exampleWill a borrower default on a loan?
Can we detect which tax returns are fraudulent?
Can we predict which brand customers will choose?

In this case we are doing classification.
10

Credit Data

Default or not?

Assessing Classification14

Some mathGenerally data yiyi and xixi available for i=1,2,3,…,ni=1,2,3,…,n.
15

Some mathGenerally data yiyi and xixi available for i=1,2,3,…,ni=1,2,3,…,n.
An algorithm is trained on this data. Some function of  xixi is derived where ^y=^f(x)y^=f^(x).
15

Some mathGenerally data yiyi and xixi available for i=1,2,3,…,ni=1,2,3,…,n.
An algorithm is trained on this data. Some function of  xixi is derived where ^y=^f(x)y^=f^(x).
How to decide if ^ff^ is a good classifier or bad classifier?
15

Misclassification

The misclassification error is given by

$\frac{1}{n} \sum_{i = 1}^{n} I (y_{i} \neq {\hat{y}}_{i})$

Here $I (.)$ equals 1 of the statement in parentheses is true and 0 otherwise.

Misclassification

The misclassification error is given by

$\frac{1}{n} \sum_{i = 1}^{n} I (y_{i} \neq {\hat{y}}_{i})$

Here $I (.)$ equals 1 of the statement in parentheses is true and 0 otherwise.
Large numbers imply a worse performance.

Misclassification

The misclassification error is given by

$\frac{1}{n} \sum_{i = 1}^{n} I (y_{i} \neq {\hat{y}}_{i})$

Here $I (.)$ equals 1 of the statement in parentheses is true and 0 otherwise.
Large numbers imply a worse performance.
Since all $n$ points are used for training and evaluation this measures in-sample performance.

Training v TestIn practice we want predictions for values of yy that are not yet observed.
17

Training v TestIn practice we want predictions for values of yy that are not yet observed.
To artificially create this scenario the data we have available can be split into twoTraining sample used to determing ^ff^
Test sample used to evaluate ^ff^.

17

Training v TestIn practice we want predictions for values of yy that are not yet observed.
To artificially create this scenario the data we have available can be split into twoTraining sample used to determing ^ff^
Test sample used to evaluate ^ff^.

The yy values of the test sample will be treated as unknown during training.
17

NotationN1N1 is the set of indices for training data.
|N1||N1| is the number of observations in training data.
N0N0 is the set of indices for test data
|N0||N0| is the number of observations in test data.
18

ExampleSuppose there are five observations, (y1,x1),(y2,x2),…,(y5,x5)(y1,x1),(y2,x2),…,(y5,x5) 
Suppose observations 1,2 and 4 are used as training data.
Suppose observations 3 and 5 are used as test data.
Then N1={1,2,4}N1={1,2,4} and |N1|=3|N1|=3
And N0={3,5}N0={3,5} and |N0|=2|N0|=2
Only the data in N1N1 is used to determine ^ff^
19

Training v Test

Training error rate

$\frac{1}{| N_{1} |} \sum_{i \in N_{1}} I (y_{i} \neq {\hat{y}}_{i})$

Test error rate

$\frac{1}{| N_{0} |} \sum_{i \in N_{0}} I (y_{i} \neq {\hat{y}}_{i})$

OverfittingSome methods perform very well (even perfectly) on training error rate.
21

OverfittingSome methods perform very well (even perfectly) on training error rate.
Usually these same methods will perform poorly on test error rate.
21

OverfittingSome methods perform very well (even perfectly) on training error rate.
Usually these same methods will perform poorly on test error rate.
This phenomenon is called overfitting.
21

OverfittingSome methods perform very well (even perfectly) on training error rate.
Usually these same methods will perform poorly on test error rate.
This phenomenon is called overfitting.
Generally achieving a low test error rate (also called out-of-sample or generalisation error) is more important.
21

A simple exampleConsider a test set of a single observation N0={j}N0={j}.
22

A simple exampleConsider a test set of a single observation N0={j}N0={j}.
The classifier is trained using all data apart from jj.
22

A simple exampleConsider a test set of a single observation N0={j}N0={j}.
The classifier is trained using all data apart from jj.
This classifier is then used to predict the value of yjyj.
22

A simple exampleConsider a test set of a single observation N0={j}N0={j}.
The classifier is trained using all data apart from jj.
This classifier is then used to predict the value of yjyj.
The choice of jj may seem arbitrary.
22

Cross validationThe process can be repeated so that each observation is left out exactly once.
23

Cross validationThe process can be repeated so that each observation is left out exactly once.
Each time all remaining observations are used as the training set.
23

Cross validationThe process can be repeated so that each observation is left out exactly once.
Each time all remaining observations are used as the training set.
This process is called Leave-one-out cross validation(LOOCV)
23

k-fold CVA faster alternative to LOOCV is k-fold cross validation
24

k-fold CVA faster alternative to LOOCV is k-fold cross validation
The data are randomly split into kk partitions.
24

k-fold CVA faster alternative to LOOCV is k-fold cross validation
The data are randomly split into kk partitions.
Each observation appears in exactly one partition, i.e. the partitions are non-overlapping.
24

k-fold CVA faster alternative to LOOCV is k-fold cross validation
The data are randomly split into kk partitions.
Each observation appears in exactly one partition, i.e. the partitions are non-overlapping.
Each partition is used as the test set exactly once.
24

For regressionIn regression rather than looking at the error rate it may be better to look at sums of squared errors.
25

For regressionIn regression rather than looking at the error rate it may be better to look at sums of squared errors.
The concepts of test and training set can be used.
25

For regressionIn regression rather than looking at the error rate it may be better to look at sums of squared errors.
The concepts of test and training set can be used.
Leave one out cross validation and k-fold cross validation can be used in the same way.
25

Next stepEventually we will introduce specific algorithms for doing classification.
26

Next stepEventually we will introduce specific algorithms for doing classification.
However for all these algorithms the distinction between training and test data is important.
26

Next stepEventually we will introduce specific algorithms for doing classification.
However for all these algorithms the distinction between training and test data is important.
Equally cross validation will consistently be used.
26

Next stepEventually we will introduce specific algorithms for doing classification.
However for all these algorithms the distinction between training and test data is important.
Equally cross validation will consistently be used.
Make sure you understand how these ideas work, separately from specific algorithms.
26

Additional issues with prediction27

Predict v ExplainIn this unit our emphasis will be on prediction.
28

Predict v ExplainIn this unit our emphasis will be on prediction.
This is very different to explanation or causality.
28

Predict v ExplainIn this unit our emphasis will be on prediction.
This is very different to explanation or causality.
Consider the example of predicting sales of a Toyota by looking at number of internet searches for "Toyota".
28

Predict v ExplainIn this unit our emphasis will be on prediction.
This is very different to explanation or causality.
Consider the example of predicting sales of a Toyota by looking at number of internet searches for "Toyota".
If there is a large number of people searching for "Toyota" it is more likely for sales of Toyota in the following period to be higher.
28

CausalityThis relationship is not easy to manipulate.
29

CausalityThis relationship is not easy to manipulate.
For instance, if Toyota instructs its employees to spend the afternoon searching the word "Toyota" on Google, sales will not go up.
29

CausalityThis relationship is not easy to manipulate.
For instance, if Toyota instructs its employees to spend the afternoon searching the word "Toyota" on Google, sales will not go up.
In this case there is a common cause for browsing for cars and buying cars; namely the intent to buy cars.
29

CausalityThis relationship is not easy to manipulate.
For instance, if Toyota instructs its employees to spend the afternoon searching the word "Toyota" on Google, sales will not go up.
In this case there is a common cause for browsing for cars and buying cars; namely the intent to buy cars.
Unlike intent to buy a car, browsing behaviour is observable and can be used for prediction.
29

Two class v MulticlassMany classification problems involve a yy variable that can take two values.Default on credit card v Not Default

30

Two class v MulticlassMany classification problems involve a yy variable that can take two values.Default on credit card v Not Default

In other cases the yy variable can take multiple valuesBrand choice, e.g. for instance Gucci v Louis Vuitton v YSL v Givenchy.

30

Two class v MulticlassMany classification problems involve a yy variable that can take two values.Default on credit card v Not Default

In other cases the yy variable can take multiple valuesBrand choice, e.g. for instance Gucci v Louis Vuitton v YSL v Givenchy.

The methods we cover are general enough for both settings.
30

Probabilistic ClassificationIn many cases an algorithm will predict a single "best" class.Predict a customer will purchase Gucci.

31

Probabilistic ClassificationIn many cases an algorithm will predict a single "best" class.Predict a customer will purchase Gucci.

In other instances an algorithm will provide probabilities.The customer has a 40% chance of purchasing Gucci, a 35% of chance of purchasing Givenchy and a 25% chance of purchasing YSL.

31

Probabilistic ClassificationA probabilistic prediction can be converted to a point prediction.
32

Probabilistic ClassificationA probabilistic prediction can be converted to a point prediction.
Simply choose the class with the highest probability.
32

Probabilistic ClassificationA probabilistic prediction can be converted to a point prediction.
Simply choose the class with the highest probability.
In the example on the previous slide the choice would be Gucci.
32

Two class caseIn the two class case, choosing the class with highest probability is simple.
33

Two class caseIn the two class case, choosing the class with highest probability is simple.
Assign to a class if the probability is greater than 0.5
33

Two class caseIn the two class case, choosing the class with highest probability is simple.
Assign to a class if the probability is greater than 0.5
In some applications a different threshold may be used.
33

Two class caseIn the two class case, choosing the class with highest probability is simple.
Assign to a class if the probability is greater than 0.5
In some applications a different threshold may be used.
This is particularly the case if there are asymmetric costs involved with different types of misclassification.
33

An exampleSuppose you work for the tax office.
34

An exampleSuppose you work for the tax office.
You need to decide who should be audited and who should not be audited.
34

An exampleSuppose you work for the tax office.
You need to decide who should be audited and who should not be audited.
When doing classification you can make two mistakesAudit an innocent person
Fail to audit a guilty person

Are these mistakes equally costly?
34

Tax exampleAuditing an innocent person is costly since resources are used for no gain.Suppose it costs $100 to audit a person.

Failing to audit a guilty person is costly since there is a failure to recover tax revenue.Let $500 be recovered from the guilty.

In this example, it is more costly to fail to audit the guilty.
However, misclassification rate treats both errors the same.
35

Sensitivity v SpecificityIn a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition.
36

Sensitivity v SpecificityIn a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition.
In the auditing example the condition can be that the person is guilty.
36

Sensitivity v SpecificityIn a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition.
In the auditing example the condition can be that the person is guilty.
Sensitivity refers to the true positive rate.  The proportion of guilty classified as guilty.
36

Sensitivity v SpecificityIn a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition.
In the auditing example the condition can be that the person is guilty.
Sensitivity refers to the true positive rate.  The proportion of guilty classified as guilty.
Specificity refers to the true negative rate.  The proportion of innocent classified as innocent.
36

Sensitivity v SpecificityConsider that we audit when the probability of being guilty is greater than 50%.
37

Sensitivity v SpecificityConsider that we audit when the probability of being guilty is greater than 50%.
Changing this threshold can change the sensitivity and specificity.
37

Sensitivity v SpecificityConsider that we audit when the probability of being guilty is greater than 50%.
Changing this threshold can change the sensitivity and specificity.
Reducing the threshold to 0 means everyone is audited.  The sensitivity will be perfect but specificity will be zero.
37

Sensitivity v SpecificityConsider that we audit when the probability of being guilty is greater than 50%.
Changing this threshold can change the sensitivity and specificity.
Reducing the threshold to 0 means everyone is audited.  The sensitivity will be perfect but specificity will be zero.
Raising the threshold to 1 means no one is audited.  The specificity will be perfect but sensitivity will be zero.
37

Example

Person
Pred. Pr. Guilty
Truth


A
0.3
Not Guilty

B
0.4
Guilty

C
0.6
Guilty

D
0.7
Guilty

38

Person	Pred. Pr. Guilty	Truth
A	0.3	Not Guilty
B	0.4	Guilty
C	0.6	Guilty
D	0.7	Guilty

QuestionsFor a threshold of 0.5What is your prediction for each individual?
What is the misclassification error?
What is the sensitivity?
What is the specificity?
What is the cost?

39

Answer

Person
Pred. Pr. Guilty
Prediction
Truth


A
0.3
Not Guilty
Not Guilty

B
0.4
Not Guilty
Guilty

C
0.6
Guilty
Guilty

D
0.7
Guilty
Guilty

40

Person	Pred. Pr. Guilty	Prediction	Truth
A	0.3	Not Guilty	Not Guilty
B	0.4	Not Guilty	Guilty
C	0.6	Guilty	Guilty
D	0.7	Guilty	Guilty

AnswersMisclassification error is 0.25.
Sensitivity is 0.6667
Specificity is 1
Cost is $500
41

Your turnHow do the answers change when the threshold is 0.2?
How do the answers change when the threshold is 0.65?
42

Answer (Threshold 0.2)

Person
Pred. Pr. Guilty
Prediction
Truth


A
0.3
Guilty
Not Guilty

B
0.4
Guilty
Guilty

C
0.6
Guilty
Guilty

D
0.7
Guilty
Guilty

43

Person	Pred. Pr. Guilty	Prediction	Truth
A	0.3	Guilty	Not Guilty
B	0.4	Guilty	Guilty
C	0.6	Guilty	Guilty
D	0.7	Guilty	Guilty

AnswersMisclassification error is 0.25.
Sensitivity is 1
Specificity is 0
Cost is $100
44

Answer (Threshold 0.65)

Person
Pred. Pr. Guilty
Prediction
Truth


A
0.3
Not Guilty
Not Guilty

B
0.4
Not Guilty
Guilty

C
0.6
Not Guilty
Guilty

D
0.7
Guilty
Guilty

45

Person	Pred. Pr. Guilty	Prediction	Truth
A	0.3	Not Guilty	Not Guilty
B	0.4	Not Guilty	Guilty
C	0.6	Not Guilty	Guilty
D	0.7	Guilty	Guilty

AnswersMisclassification error is 0.5.
Sensitivity is 0.333
Specificity is 1
Cost is $1000
46

ConclusionFor the remainder of the unit the focus is on different ways to do classification.
47

ConclusionFor the remainder of the unit the focus is on different ways to do classification.
In a business (and any other setting) be aware thatCorrelation does not imply causation
Prediction should be thought about probabilistically.
Cost should be taken into account when classification is used in decision making.

47

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help