Decision TreesData Visualisation and AnalyticsAnastasios Panagiotelis and Lauren KennedyLecture 111

Trees2

Credit Data

Default or not?

A tree

Predict default or no default for

A tree

Predict default or no default for

45 year old with a limit of 100000
45 year old with a limit of 50000
35 year old with a limit of 50000

A tree

Predict default or no default for

45 year old with a limit of 100000
45 year old with a limit of 50000
35 year old with a limit of 50000

Using the information on the next slide

Decision tree

Basic terminologyEvery line is called an edge or a branch. They are connected to nodes.
8

Basic terminologyEvery line is called an edge or a branch. They are connected to nodes.
At the nodes a rule determines which branch to take to the next node.
8

Basic terminologyEvery line is called an edge or a branch. They are connected to nodes.
At the nodes a rule determines which branch to take to the next node.
Nodes at the bottom are called terminal nodes or leaves.
8

Basic terminologyEvery line is called an edge or a branch. They are connected to nodes.
At the nodes a rule determines which branch to take to the next node.
Nodes at the bottom are called terminal nodes or leaves.
At the leaves a decision is made on how to classify the variable.
8

Basic terminologyEvery line is called an edge or a branch. They are connected to nodes.
At the nodes a rule determines which branch to take to the next node.
Nodes at the bottom are called terminal nodes or leaves.
At the leaves a decision is made on how to classify the variable.
Note that the tree is upside down!
8

Partitioning

If $x_{j}$ is a single predictor, then rules that determine each decision have the following form

$\begin{aligned} If x_{j} & > c then go to one node \\ If x_{j} & \leq c then go to the other node \end{aligned}$

This is called binary splitting

PartitionEach leaf corresponds to a box of values for the predictors.
10

PartitionEach leaf corresponds to a box of values for the predictors.
Each of these boxes contains a few training observations.
10

PartitionEach leaf corresponds to a box of values for the predictors.
Each of these boxes contains a few training observations.
Within each box find the most frequent class amongst these training observations.
10

PartitionEach leaf corresponds to a box of values for the predictors.
Each of these boxes contains a few training observations.
Within each box find the most frequent class amongst these training observations.
This yields the predicted class for each box.
10

PartitionEach leaf corresponds to a box of values for the predictors.
Each of these boxes contains a few training observations.
Within each box find the most frequent class amongst these training observations.
This yields the predicted class for each box.
This will be clearer when we look at decision boundaries.
10

How to split?The challenge is to find an xjxj and cc to make each split.
11

How to split?The challenge is to find an xjxj and cc to make each split.
One way is to define some criteria of best split
11

How to split?The challenge is to find an xjxj and cc to make each split.
One way is to define some criteria of best split
Then try all combinations of xjxj and cc.
11

How to split?The challenge is to find an xjxj and cc to make each split.
One way is to define some criteria of best split
Then try all combinations of xjxj and cc.
For continuous xjxj, simply rank the xjxj from smallest to largest. 
Then consider the midpoints between consecutive values of xjxj
11

Simple example first split

What is a good split?There are multiple ways of thinking about a good split.
13

What is a good split?There are multiple ways of thinking about a good split.
One is to think in terms of misclassification error.
13

What is a good split?There are multiple ways of thinking about a good split.
One is to think in terms of misclassification error.
Another set of measures aim for the partition to be pure.
13

What is a good split?There are multiple ways of thinking about a good split.
One is to think in terms of misclassification error.
Another set of measures aim for the partition to be pure.
By purity we mean than as most observations within a partition belong to the same class.
13

Gini Impurity

Let $p_{m k}$ be the proportion of training observations in partition $m$ in class $k$ and $K$ be the total number of classes.

$G = \sum_{k = 1}^{K} p_{m k} (1 - p_{m k})$

In the two class case maximised when half of the observations are in each class
Minimised when all observations are in a single class.

Perfect split

Worst split

Not best or worst

ExerciseIn the last plot calculate the Gini impurity for both partitions.
18

ExerciseIn the last plot calculate the Gini impurity for both partitions.
For the bottom, Gini Impurity is 0
18

ExerciseIn the last plot calculate the Gini impurity for both partitions.
For the bottom, Gini Impurity is 0
For the top 1323+2313=491323+2313=49
18

Stopping19

When to stopIn principle a tree can be grown so that every training observation is in its own partition.
20

When to stopIn principle a tree can be grown so that every training observation is in its own partition.
In this case the in sample fit will be perfect
20

When to stopIn principle a tree can be grown so that every training observation is in its own partition.
In this case the in sample fit will be perfect
This does not work well for out of sample prediction.
20

ControlThe complexity of the tree can be controlled in a number of ways:
21

ControlThe complexity of the tree can be controlled in a number of ways:Set a maximum depth of tree.
Set a minimum number of training observations in each partition.
Only accept a split that improves the criterion by a fixed amount.
Pruning

21

PruningThe idea behind pruning is to start with a large tree.
22

PruningThe idea behind pruning is to start with a large tree.
A new objective function is defined that penalises larger trees.
22

PruningThe idea behind pruning is to start with a large tree.
A new objective function is defined that penalises larger trees.
Consider different sub trees of the large tree that optimise the penalised objective.
22

PruningThe idea behind pruning is to start with a large tree.
A new objective function is defined that penalises larger trees.
Consider different sub trees of the large tree that optimise the penalised objective.
There is a tuning parameter that can be set by CV.
22

Decision Boundary23

Data again

Decision BoundariesFor the next few slides the same grid of points that was used for kNN and DA is presented again.
25

Decision BoundariesFor the next few slides the same grid of points that was used for kNN and DA is presented again.
If the tree predicts "Default" the grid point is colored black.
25

Decision BoundariesFor the next few slides the same grid of points that was used for kNN and DA is presented again.
If the tree predicts "Default" the grid point is colored black.
If the tree predicts "No default" the grid point is colored yellow.
25

Decision BoundariesFor the next few slides the same grid of points that was used for kNN and DA is presented again.
If the tree predicts "Default" the grid point is colored black.
If the tree predicts "No default" the grid point is colored yellow.
Think about how these decision boundaries compare to kNN and DA.
25

Decision Boundary: Biggest Tree

Decision Tree: Biggest Tree

ControlsThis tree is too complicated and will overfit the data.
28

ControlsThis tree is too complicated and will overfit the data.
Build a smaller tree using defaults of the R function.
28

ControlsThis tree is too complicated and will overfit the data.
Build a smaller tree using defaults of the R function.
There must be at least 7 observations in a partition.
28

ControlsThis tree is too complicated and will overfit the data.
Build a smaller tree using defaults of the R function.
There must be at least 7 observations in a partition.
If a split does not improve Gini Impurity by more than 0.01 the algorithm stops.
28

Decision Boundary: Smaller Tree

Multiclass ExampleOn the next few slides we will consider a different dataset where the target variable can take three values.
30

Multiclass ExampleOn the next few slides we will consider a different dataset where the target variable can take three values.
Using the mpg dataset we will predict whether a car is a 4wd, a rear wheel drive or a front wheel drive.
30

Multiclass ExampleOn the next few slides we will consider a different dataset where the target variable can take three values.
Using the mpg dataset we will predict whether a car is a 4wd, a rear wheel drive or a front wheel drive.
The predictors will be fuel efficiency on the highway (hwy) and engine size (displ)
30

Multiclass Example: Data

Multiclass Example: Tree

Decision Tree: Multiclass

Stability of Trees34

Low variabilityCompared to LDA and QDA, trees have higher variability across different training data sets.
35

Low variabilityCompared to LDA and QDA, trees have higher variability across different training data sets.
This can be mitigated by choosing smaller trees.
35

Low variabilityCompared to LDA and QDA, trees have higher variability across different training data sets.
This can be mitigated by choosing smaller trees.
The next few slides show the decision boundaries of trees for different subsamples of training data.
35

Train Sample 1: Big Tree

Train Sample 2: Big Tree

Train Sample 1: Small Tree

Train Sample 2: Small Tree

Trees in R40

PackageClassification trees can be implemented using the rpart package.
41

PackageClassification trees can be implemented using the rpart package.
We will demonstrate using the mpg data.
41

PackageClassification trees can be implemented using the rpart package.
We will demonstrate using the mpg data.
Like last week split the data into training and test.
41

PackageClassification trees can be implemented using the rpart package.
We will demonstrate using the mpg data.
Like last week split the data into training and test.
Find a decision tree on the training data and form predictions for test data
41

Split Sample

#Find total number of observations
n<-NROW(mpg) 
#Create a vector allocating each observation to train 
#or test
train_or_test<-ifelse(runif(n)<0.7,'Train','Test')
#Add to mpg data frame
mpg_exp<-add_column(mpg,Sample=train_or_test)
#Isolate Training Data 
mpg_train<-filter(mpg_exp,Sample=='Train')
#Isolate Test Data 
mpg_test<-filter(mpg_exp,Sample=='Test')

Tree

#Default Settings
rpart_small<-rpart(drv~displ+hwy,data = mpg_train)
#Bigger tree
#Allow for partitions with as few as two 
#training observations
#Accept any split that improves fit
rpart_big<-rpart(drv~displ+hwy,data = mpg_train,
                 control = rpart.control(minbucket=2,
                                         cp=0))
#Make predictions
pred_small<-predict(rpart_small,mpg_test,type='class')
pred_big<-predict(rpart_big,mpg_test,type='class')

Missclasification Rate

To compute test misclassification

mean(pred_small!=mpg_test$drv)

## [1] 0.15625

mean(pred_big!=mpg_test$drv)

## [1] 0.140625

Missclasification Rate

To compute test misclassification

mean(pred_small!=mpg_test$drv)

## [1] 0.15625

mean(pred_big!=mpg_test$drv)

## [1] 0.140625

In the bigger tree perform better out-of-sample (this is rare).

ProbabilitiesBy leaving out the type='class' option probabilities are returned
45

ProbabilitiesBy leaving out the type='class' option probabilities are returned
These probabilities are simply the proportions of the classes within each partition.
45

Probabilities

By leaving out the type='class' option probabilities are returned
These probabilities are simply the proportions of the classes within each partition.

pred_small<-predict(rpart_small,mpg_test)
pred_small

##            4          f          r
## 1  0.0000000 1.00000000 0.00000000
## 2  0.1315789 0.81578947 0.05263158
## 3  0.1315789 0.81578947 0.05263158
## 4  0.1315789 0.81578947 0.05263158
## 5  0.1315789 0.81578947 0.05263158
## 6  0.0000000 0.00000000 1.00000000
## 7  0.0000000 1.00000000 0.00000000
## 8  0.0000000 1.00000000 0.00000000
## 9  0.1315789 0.81578947 0.05263158
## 10 0.9743590 0.02564103 0.00000000
## 11 0.9743590 0.02564103 0.00000000
## 12 0.9743590 0.02564103 0.00000000
## 13 0.9743590 0.02564103 0.00000000
## 14 0.9743590 0.02564103 0.00000000
## 15 0.9743590 0.02564103 0.00000000
## 16 0.7777778 0.00000000 0.22222222
## 17 0.9743590 0.02564103 0.00000000
## 18 0.9743590 0.02564103 0.00000000
## 19 0.9743590 0.02564103 0.00000000
## 20 0.9743590 0.02564103 0.00000000
## 21 0.9743590 0.02564103 0.00000000
## 22 0.3333333 0.00000000 0.66666667
## 23 0.9743590 0.02564103 0.00000000
## 24 0.9743590 0.02564103 0.00000000
## 25 0.9743590 0.02564103 0.00000000
## 26 0.1315789 0.81578947 0.05263158
## 27 0.1315789 0.81578947 0.05263158
## 28 0.0000000 0.00000000 1.00000000
## 29 0.0000000 1.00000000 0.00000000
## 30 0.0000000 1.00000000 0.00000000
## 31 0.3125000 0.68750000 0.00000000
## 32 0.8571429 0.14285714 0.00000000
## 33 0.1315789 0.81578947 0.05263158
## 34 0.3125000 0.68750000 0.00000000
## 35 0.3125000 0.68750000 0.00000000
## 36 0.1315789 0.81578947 0.05263158
## 37 0.9743590 0.02564103 0.00000000
## 38 0.7777778 0.00000000 0.22222222
## 39 0.9743590 0.02564103 0.00000000
## 40 0.9743590 0.02564103 0.00000000
## 41 0.9743590 0.02564103 0.00000000
## 42 0.9743590 0.02564103 0.00000000
## 43 0.0000000 1.00000000 0.00000000
## 44 0.1315789 0.81578947 0.05263158
## 45 0.9743590 0.02564103 0.00000000
## 46 0.0000000 0.00000000 1.00000000
## 47 0.3125000 0.68750000 0.00000000
## 48 0.9743590 0.02564103 0.00000000
## 49 0.9743590 0.02564103 0.00000000
## 50 0.9743590 0.02564103 0.00000000
## 51 0.0000000 1.00000000 0.00000000
## 52 0.0000000 1.00000000 0.00000000
## 53 0.1315789 0.81578947 0.05263158
## 54 0.1315789 0.81578947 0.05263158
## 55 0.0000000 1.00000000 0.00000000
## 56 0.0000000 1.00000000 0.00000000
## 57 0.7777778 0.00000000 0.22222222
## 58 0.9743590 0.02564103 0.00000000
## 59 0.9743590 0.02564103 0.00000000
## 60 0.1315789 0.81578947 0.05263158
## 61 0.0000000 1.00000000 0.00000000
## 62 0.0000000 1.00000000 0.00000000
## 63 0.1315789 0.81578947 0.05263158
## 64 0.0000000 1.00000000 0.00000000

Plotting the trees

The pacakge rpart.plot is good for plotting trees themselves.
The trees we have plot so far use code such as that below.

rpart.plot(rpart_small,extra = 0,type = 0)
rpart.plot(rpart_small,extra = 0,type = 0)

Small Tree

Big Tree

More detailBy default, the function rpart.plot actually provides more information.  Try the following
49

More detail

By default, the function rpart.plot actually provides more information. Try the following

rpart.plot(rpart_small)

More detail

By default, the function rpart.plot actually provides more information. Try the following

rpart.plot(rpart_small)

This provides

More detail

By default, the function rpart.plot actually provides more information. Try the following

rpart.plot(rpart_small)

This provides
- The most frequent class in each split
- The proportion of all classes in each split
- The proportion of data in each split

Small Tree

Regression TreesThe same ideas can be applied to regression.
51

Regression TreesThe same ideas can be applied to regression.
The prediction is the average value of the dependent variable for all training observations in the same partition.
51

Regression TreesThe same ideas can be applied to regression.
The prediction is the average value of the dependent variable for all training observations in the same partition.
Splits can be chosen to minimise sum of squared errors rather than Gini impurity.
51

Ensemble methodsDue to their high variablity ensemble methods are often used together with trees.
52

Ensemble methodsDue to their high variablity ensemble methods are often used together with trees.
One way to do this is to resample the data many times and fit a new tree each time (bagging).
52

Ensemble methodsDue to their high variablity ensemble methods are often used together with trees.
One way to do this is to resample the data many times and fit a new tree each time (bagging).
When the number of predictors is large these can also be randomly sampled (random forest).
52

Ensemble methodsDue to their high variablity ensemble methods are often used together with trees.
One way to do this is to resample the data many times and fit a new tree each time (bagging).
When the number of predictors is large these can also be randomly sampled (random forest).
A prediction is obtained from each tree, and the prediction will be the most frequent class across trees.
52

ConclusionClassification trees are a very intuitive and interpretable way that allow for data to guide decision making.
53

ConclusionClassification trees are a very intuitive and interpretable way that allow for data to guide decision making.
In practice much care must be taken to prevent overfitting.
53

ConclusionClassification trees are a very intuitive and interpretable way that allow for data to guide decision making.
In practice much care must be taken to prevent overfitting.
Even so, trees are very sensitive to small changes in training data.
53

ConclusionClassification trees are a very intuitive and interpretable way that allow for data to guide decision making.
In practice much care must be taken to prevent overfitting.
Even so, trees are very sensitive to small changes in training data.
As such, trees are usually combined in more sophisticated ensemble learning methods.
53

ConclusionClassification trees are a very intuitive and interpretable way that allow for data to guide decision making.
In practice much care must be taken to prevent overfitting.
Even so, trees are very sensitive to small changes in training data.
As such, trees are usually combined in more sophisticated ensemble learning methods.
This does come at the cost of interpretability.
53

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help