class: center, middle, inverse, title-slide # Predictive Analytics ## Data Visualisation and Analytics ### Anastasios Panagiotelis and Lauren Kennedy ### Lecture 8 --- class: inverse, center, middle # Time for Analytics --- # Minor adjustments - So far our attention has completely been on *visualisation*. -- - For the remainder of the unit we will focus on *analytics* -- - In particular our focus will be on making *predictions*. --- # Textbook - Much of the content of these slides is covered in [Introduction to Statistical Learning](http://faculty.marshall.usc.edu/gareth-james/ISL/) by James, Witten Hastie, and Tibshirani. -- - In particular Chapters 2, 4 and 8. -- - The textbook is not mandatory but you may find it useful. -- - It is available for free. --- # Prediction - Prediction arises in many business contexts. -- - There is some unknown variable that is the target of the prediction. - This is usually denoted `\(y\)` and may be called the *dependent variable*, or *response* or *target variable*. -- - There are some known variables that are used to make the prediction. - These are usually denoted `\(\mathbf{x}\)` and may be called the *independent variables*, or *predictors* or *features*. --- # Supervised Learning - For some observations data will be available for both `\(y\)` AND `\(\mathbf{x}\)`. -- - We can use these observations to *learn* some rule that gives predictions of `\(y\)` as a function of `\(\mathbf{x}\)`. -- - This prediction is denoted `\(\hat{y}=\hat{f}(\mathbf{x})\)` -- - This general setup is often called *supervised learning*. --- #Summary |Variable |Training |Evaluation | |-----------------------|---------------|-------------------| |Predictor 1 (X1) |Data available |Data available | |Predictor 2 (X2) |Data available |Data available | |Dependent Variable (Y) |Data available |Data NOT available | --- #Example |Variable |Old Customers |New Customer | |-----------------------|---------------|-------------------| |Age (X1) |Data available |Data available | |Limit (X2) |Data available |Data available | |Default (Y) |Data available |Data NOT available | --- # Regression - Sometimes `\(y\)` is a numeric (metric) variable. For example - Company profit next month. - Amount spent by a customer. - Demand for a new product. -- - In this case we are doing *regression*. -- - This can be more general than the linear regression that you may be familiar with. --- # Classification - Sometimes `\(y\)` is a categorical (nominal, non-metric) variable. For example - Will a borrower default on a loan? - Can we detect which tax returns are fraudulent? - Can we predict which brand customers will choose? -- - In this case we are doing *classification*. --- # Credit Data <img src="PredictiveAnalytics_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Default or not? <img src="PredictiveAnalytics_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Default or not? <img src="PredictiveAnalytics_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- class: inverse, middle, center # Assessing Classification --- # Some math - Generally data `\(y_i\)` and `\(\mathbf{x}_i\)` available for `\(i=1,2,3,\ldots,n\)`. -- - An algorithm is trained on this data. Some function of `\(\mathbf{x}_i\)` is derived where `\(\hat{y}=\hat{f}(\mathbf{x})\)`. -- - How to decide if `\(\hat{f}\)` is a good classifier or bad classifier? --- # Misclassification - The **misclassification error** is given by $$ \frac{1}{n}\sum\limits_{i=1}^n I(y_i\neq\hat{y}_i) $$ - Here `\(I(.)\)` equals 1 of the statement in parentheses is true and 0 otherwise. -- - Large numbers imply a worse performance. -- - Since all `\(n\)` points are used for training and evaluation this measures *in-sample* performance. --- # Training v Test - In practice we want predictions for values of `\(y\)` that are not yet observed. -- - To artificially create this scenario the data we have available can be split into two - *Training* sample used to determing `\(\hat{f}\)` - *Test* sample used to evaluate `\(\hat{f}\)`. -- - The `\(y\)` values of the test sample will be treated as unknown during training. --- #Notation - `\(\mathcal{N}_{1}\)` is the set of indices for training data. - `\(|\mathcal{N}_{1}|\)` is the number of observations in training data. - `\(\mathcal{N}_{0}\)` is the set of indices for test data - `\(|\mathcal{N}_{0}|\)` is the number of observations in test data. --- # Example - Suppose there are five observations, `\((y_1,\mathbf{x}_1),(y_2,\mathbf{x}_2),\ldots,(y_5,\mathbf{x}_5)\)` - Suppose observations 1,2 and 4 are used as training data. - Suppose observations 3 and 5 are used as test data. - Then `\(\mathcal{N_1}=\left\{1,2,4\right\}\)` and `\(|\mathcal{N_1}|=3\)` - And `\(\mathcal{N_0}=\left\{3,5\right\}\)` and `\(|\mathcal{N_0}|=2\)` - Only the data in `\(\mathcal{N_1}\)` is used to determine `\(\hat{f}\)` --- # Training v Test *Training* error rate `$$\frac{1}{|\mathcal{N}_{1}|}\sum\limits_{i\in\mathcal{N}_{1}} I(y_i\neq\hat{y}_i)$$` *Test* error rate `$$\frac{1}{|\mathcal{N}_{0}|}\sum\limits_{i\in\mathcal{N}_{0}} I(y_i\neq\hat{y}_i)$$` --- # Overfitting - Some methods perform very well (even perfectly) on training error rate. -- - Usually these same methods will perform poorly on test error rate. -- - This phenomenon is called *overfitting*. -- - Generally achieving a low **test** error rate (also called out-of-sample or generalisation error) is more important. --- # A simple example - Consider a test set of a single observation `\(\mathcal{N_0}=\left\{j \right\}\)`. -- - The classifier is trained using all data apart from `\(j\)`. -- - This classifier is then used to predict the value of `\(y_j\)`. -- - The choice of `\(j\)` may seem arbitrary. --- # Cross validation - The process can be repeated so that each observation is left out exactly once. -- - Each time all remaining observations are used as the training set. -- - This process is called **Leave-one-out cross validation(LOOCV)** --- # k-fold CV - A faster alternative to LOOCV is **k-fold cross validation** -- - The data are randomly split into `\(k\)` partitions. -- - Each observation appears in exactly one partition, i.e. the partitions are *non-overlapping*. -- - Each partition is used as the test set exactly once. --- # For regression - In regression rather than looking at the error rate it may be better to look at sums of squared errors. -- - The concepts of test and training set can be used. -- - Leave one out cross validation and k-fold cross validation can be used in the same way. --- # Next step - Eventually we will introduce specific algorithms for doing classification. -- - However for all these algorithms the distinction between training and test data is important. -- - Equally cross validation will consistently be used. -- - Make sure you understand how these ideas work, separately from specific algorithms. --- class:inverse,middle,center # Additional issues with prediction --- # Predict v Explain - In this unit our emphasis will be on *prediction*. -- - This is very different to *explanation* or *causality*. -- - Consider the example of predicting sales of a Toyota by looking at number of internet searches for "Toyota". -- - If there is a large number of people searching for "Toyota" it is more likely for sales of Toyota in the following period to be higher. --- # Causality - This relationship is not easy to manipulate. -- - For instance, if Toyota instructs its employees to spend the afternoon searching the word "Toyota" on Google, sales will not go up. -- - In this case there is a *common cause* for browsing for cars and buying cars; namely the intent to buy cars. -- - Unlike intent to buy a car, browsing behaviour is observable and can be used for prediction. --- # Two class v Multiclass - Many classification problems involve a `\(y\)` variable that can take two values. - Default on credit card v Not Default -- - In other cases the `\(y\)` variable can take multiple values - Brand choice, e.g. for instance Gucci v Louis Vuitton v YSL v Givenchy. -- - The methods we cover are general enough for both settings. --- # Probabilistic Classification - In many cases an algorithm will predict a single "best" class. - Predict a customer will purchase Gucci. -- - In other instances an algorithm will provide probabilities. - The customer has a 40% chance of purchasing Gucci, a 35% of chance of purchasing Givenchy and a 25% chance of purchasing YSL. --- # Probabilistic Classification - A probabilistic prediction can be converted to a point prediction. -- - Simply choose the class with the highest probability. -- - In the example on the previous slide the choice would be Gucci. --- # Two class case - In the two class case, choosing the class with highest probability is simple. -- - Assign to a class if the probability is greater than 0.5 -- - In some applications a different threshold may be used. -- - This is particularly the case if there are asymmetric costs involved with different types of misclassification. --- # An example - Suppose you work for the tax office. -- - You need to decide who should be audited and who should not be audited. -- - When doing classification you can make two mistakes - Audit an innocent person - Fail to audit a guilty person - Are these mistakes equally costly? --- # Tax example - Auditing an innocent person is costly since resources are used for no gain. - Suppose it costs $100 to audit a person. - Failing to audit a guilty person is costly since there is a failure to recover tax revenue. - Let $500 be recovered from the guilty. - In this example, it is more costly to fail to audit the guilty. - However, misclassification rate treats both errors the same. --- #Sensitivity v Specificity - In a 2-class problem think of one class as the presence of a condition and the other class as the absence of a condition. -- - In the auditing example the condition can be that the person is guilty. -- - *Sensitivity* refers to the true positive rate. The proportion of guilty classified as guilty. -- - *Specificity* refers to the true negative rate. The proportion of innocent classified as innocent. --- #Sensitivity v Specificity - Consider that we audit when the probability of being guilty is greater than 50%. -- - Changing this threshold can change the sensitivity and specificity. -- - Reducing the threshold to 0 means everyone is audited. The sensitivity will be perfect but specificity will be zero. -- - Raising the threshold to 1 means no one is audited. The specificity will be perfect but sensitivity will be zero. --- #Example |Person |Pred. Pr. Guilty |Truth | |--------|-----------------|-------------------| |A |0.3 |Not Guilty | |B |0.4 |Guilty | |C |0.6 |Guilty | |D |0.7 |Guilty | --- # Questions - For a threshold of 0.5 - What is your prediction for each individual? - What is the misclassification error? - What is the sensitivity? - What is the specificity? - What is the cost? --- # Answer |Person |Pred. Pr. Guilty |Prediction |Truth | |--------|-----------------|-----------|-------------------| |A |0.3 |Not Guilty |Not Guilty | |B |0.4 |Not Guilty |Guilty | |C |0.6 |Guilty |Guilty | |D |0.7 |Guilty |Guilty | --- # Answers - Misclassification error is 0.25. - Sensitivity is 0.6667 - Specificity is 1 - Cost is $500 --- # Your turn - How do the answers change when the threshold is 0.2? - How do the answers change when the threshold is 0.65? --- # Answer (Threshold 0.2) |Person |Pred. Pr. Guilty |Prediction |Truth | |--------|-----------------|-----------|-------------------| |A |0.3 |Guilty |Not Guilty | |B |0.4 |Guilty |Guilty | |C |0.6 |Guilty |Guilty | |D |0.7 |Guilty |Guilty | --- # Answers - Misclassification error is 0.25. - Sensitivity is 1 - Specificity is 0 - Cost is $100 --- # Answer (Threshold 0.65) |Person |Pred. Pr. Guilty |Prediction |Truth | |--------|-----------------|-----------|-------------------| |A |0.3 |Not Guilty |Not Guilty | |B |0.4 |Not Guilty |Guilty | |C |0.6 |Not Guilty |Guilty | |D |0.7 |Guilty |Guilty | --- # Answers - Misclassification error is 0.5. - Sensitivity is 0.333 - Specificity is 1 - Cost is $1000 --- # Conclusion - For the remainder of the unit the focus is on different ways to do classification. -- - In a business (and any other setting) be aware that - Correlation does not imply causation - Prediction should be thought about probabilistically. - Cost should be taken into account when classification is used in decision making.