Trees

This note will dive into two methods which make use of binary trees to build models for estimating future data. "The key difference between classification and regression trees is that in classification the dependent variables are categorical and unordered while in regression the dependent variables are continuous or ordered whole values."

image.png Continuous 3 dimensional function (left) with a step-wise estimation (right)

Advantages of trees

  • simplicity and ease of communicating
  • compact representation
  • speed of computation (easy to parallelize)
  • easy to handle both continuous and categorical variables (and mixtures of them)
  • easily implement different loss functions
  • easy to handle missing data
  • automatic variable selection

Disadvantages of trees

  • instability of the results
    • bagging
    • boosting
    • random forest
  • difficulties with dynamic data (diff. integrating new information)
  • hard to approximate steep functions
  • no formal statistical procedures
    • hypothesis testing
    • confidence intervals
    • ...
  • difficult to evaluate variable importance

Regression

Regression trees are piecewise constant functions made up of local averages. The idea is to split the data into intervals along the $x$-axis, grouping data with as little variation along the $y$-axis as possible. The average of these areas can then give an approximation for data points that fall into this interval.

The concept can be expanded to higher dimensions, as seen in the figure above. We'll start, however, by looking at a 2-dimensional example, along an $x$- and $y$-axis.

However, before we split up the data, we need to ask ourselves three questions.

  1. How many intervals do we need to split our $x$-axis into?
  2. Where exactly should these splits occur?
  3. Which value of $y$ must be assigned to each interval?

The answer to the last question is found with basic calculus. For an interval $R_n$ with length $|R_n|$, $f(x)$ can be estimated from:

$$\frac{1}{|R_n|}\int_{R_n} f(x) dx$$

This can also be thought of as adding the $f(x)$ values across the whole interval, then dividing by the length of the interval. In other words, we're looking for the average of our data over a specific interval defined as $[a,b]$:

$$\hat y = \frac{1}{(b-a)}\sum_a^b y_i$$

The answer to the second question lies in the "slope" of the function we're trying to predict. Large jumps in $y$-values as $x$ increases will lead to large residual errors. This means that areas with high variation in $y$-values should be given smaller intervals along the $x$-axis, and areas with low variation in $y$-values should be given larger intervals along the $x$-axis.

The number of intervals we want to include in our model will depend on how complex of a model we would like to build. Usually, intervals shouldn't be given fewer than 20 data points, although when using very small data sets with high variance exceptions can be made.

By following the same logic as binary trees, regression trees can be represented by directed flow charts.

image.png

Notice how the possible answers for each node gives a different area on the plot.

These flow charts may not be as visually appealing as the plot of the step-wise estimation shown at the beginning of this section, but they are very cheap to store and compute, since they only really consist of nodes and leaves. It can also be improved very easily by just adding another node.

Growing a tree

A mentioned above, the more nodes we give our regression trees, the more complex our model will be. The process growing a regression tree is done recursively, by comparing the residual errors a tree with $n$ nodes with the residual error from a tree with $n-1$ nodes.

We start out by giving the tree one node $p=1$. Choosing this node will be describe a little later. The residual errors for each side of the node are found in three steps:

  1. Find the average of the data points in that interval.
  2. Find and square the residual differences (i.e. subtract the actual data point from the average and square result)
  3. Add these squared differences together.

Adding the errors from both sides of the interval gives total residual error for this split.

The two new intervals can now also be split, comparing their respective residual errors to a new one generated from the new split. As long as the errors keep decreasing, you can continue splitting and comparing, until each leaf contains only one data point (complete interpolation). This can be avoided by pruning.

It could be useful to note that this method may not be the best at finding global minimums, it is a simple (low complexity), yet acceptable tool for local analysis of some covariates.

Pruning a tree

By solely minimizing the residual error, as done above, we will eventually begin over-fitting our data. To counter this, we add a penalty term, $\alpha J$ to the model which penalizes complexity. That way, at some point, the large number of nodes will cause more error than the better fit is making up for, thus tipping the model from under-fit over to over-fit. Mathematically, this is written as:

$$ C_{\alpha}(J) = \sum _{n=1}^J D_n + \alpha J $$

where $D_n$ is the $n$-th of $J$ residual errors explained above, and $\alpha$ is our tuning parameter.

As $J\rightarrow \infty$, i.e. a lot of nodes are added to the tree, $C(J) \rightarrow \infty$ while $\sum_{n=1}^J D_n \rightarrow 0$, assuming $\alpha > 0$.

image.png

We can use cross-validation to find an acceptable $\alpha$. This process:

  1. Splits the data into $k$-folds.
  2. Builds $J$ different regression trees, with $1$ to $J$ nodes, respectively, using the data from folds $1,2,...,k-1$
  3. Finds each tree's residual error by comparing the model to the $k$-th fold of data.
  4. Stores the amount of nodes giving the lowest deviance. Remember that the over-fit model will not give good results on the completely new data from the $k$-th fold.
  5. Steps 2,3, & 4 are repeated $k-1$ more times, each time changing only which fold of data to exclude from training and use for testing.
  6. The average number folds, along with the average lowest deviance are used to calculate $\alpha$ as the slope between this point and the origin (**red line** in figure over).

Once we have $\alpha$, we will know how complex our model needs to be in order to tell us as much as possible, while retaining reliability. In other words, we know know how many nodes we should include in our regression tree.

Regression Trees in R

Let's walk through an example of regression tree analysis using R. We will be using an example dataset from http://azzalini.stat.unipd.it/Book-DM/yesterday.dat.

We start by reading in the data. The response values of our data set are split into two parts: yesterday's data and tomorrow's data. This division will come in handy when comparing our models built on yesterday's data with new data from tomorrow.

In [1]:
dataset <- read.table("http://azzalini.stat.unipd.it/Book-DM/yesterday.dat", header = TRUE)
x <- rep(dataset$x,2)
y <- c(dataset$y.yesterday, dataset$y.tomorrow)


# figure 4.19
plot(x, y, type = "n")
points(dataset$x, dataset$y.yesterday, pch = 16, col = 2)
points(dataset$x, dataset$y.tomorrow, pch = 17, col = 3)
legend('bottomright', legend = c("Yesterday", "Tomorrow"), pch = 16:17, col = 2:3)

As we can see in the plot, the two data sets seem to follow the same trend, which means we should be able to build a model for them.

The next step is to actually build the tree from yesterday's data. We make use of the R library tree to create this estimation.

In [45]:
# install.packages('tree')
library('tree')

x0 <- dataset$x
y0 <- dataset$y.yesterday
t3 <- tree(y0 ~ x0, control = tree.control(nobs = 30, mindev = 0.001, minsize = 2))
# figure 4.21a
plot(t3)
text(t3)
plot(t3)
Warning message:
"package 'tree' was built under R version 3.6.3"

As you can see from the tree above, the tree function returns a full regression tree, where every leaf node represents another observation. This is an example of an over-fit.

Let's now try pruning our tree using data from tomorrow to see which nodes should be kept, and which be can neglected. This can be done using prune.tree from the tree library, and setting newdata = tomorrow's data. This function executes a cross-validation to find the optimal $\alpha$ for the future data. It then returns a model mapping model complexities with their relative deviance values.

In [5]:
par(mar = c(3.5, 3.5, 2, 1) + 0.1)
p3 <- prune.tree(t3, newdata = data.frame(x0 = x0, y0 = dataset$y.tomorrow))
# figure 4.21b
plot(p3)

The decreasing in deviance as size increases from 0 shows our model getting better. Eventually, we see an increase in deviance again. This means that we've reached the balance point between complexity and deviance.

Since we first achieved the lowest recorded deviance at size 4, we can tell the prune.tree function to use this as the "best" parameter. Now, it will split the data into 4 intervals, then explain what percentage of the data points fall within these intervals

In [18]:
p4 <- prune.tree(t3, best = 4)
# figure 4.21c
plot(p4)
text(p4)

This model now represents the optimal regression tree for these data, with the correct pruning. As we can see from the graph above, the 4 intervals are pretty balanced, almost being split into 50-50 at each node.

In [20]:
# figure 4.21d
x0 <- seq(0.5, 3, length = 5000)
plot(x, y, type = "n")
points(dataset$x, dataset$y.yesterday, pch = 16, col = 2)
points(dataset$x, dataset$y.tomorrow, pch = 17, col = 3)
legend(2, 0.45, legend = c("Yesterday", "Tomorrow"), pch = 16:17, col = 2:3)
lines(x0, predict(p4, newdata = data.frame(x = x0)), col = 4)

Our model seems to fit our data well, while remaining incredibly simple.

Classification

Classification trees are a lot like regression trees, only, they help compare qualitative data, whereas regression trees work on quantitative data.

With regression trees, we broke our data sets into intervals based on a series of booleans about the numeric values of the response variable. The result was a binary tree, with local averages at the leaf nodes, depicting that region's particular step function.

Classification tress follow a similar structure, but get the values of the leaf nodes in a different manner. Whereas the values at the leaf nodes of regression trees come directly from the response values in their respective intervals, the leaf nodes of classification trees represent one of the states within the given discrete state-space, $K$, making up the response domain. This is just fancy talk for saying that the categories making up our response variable are found at the leaf nodes of a classification tree. An approximation of how sure you can be that data points within that interval belong to this category is also generated for leaf node.

This means that we are no longer interested in finding the data's probability distribution, $f(x)$, since with qualitative data such a distribution doesn't make sense. We are rather interested in finding the long term probabilities, $\pi_k$. This will tell us the probability of a new data point having response value $y=k$, no matter what interval it falls into.

Long term probabilities

The optimal classification tree breaks up all future data sets into $J$ partitions, or regions, where each partition has the highest probability of correctly guessing the response values for the data. In order to do so, we need to find the long term probability that a response variable will be in a specific state.

Let's start simple, by using a binary response variable, with possible values 0 and 1. We can then define $\pi_1$ as the long term probability of getting a response variable equal to 1:

$$\pi_1(x)=P(Y=1 | X=x) $$

The exact real values of $\pi_k$ are hard to find, because you would need an $\infty$ number observations to test. We can instead find an estimate of $\pi_k$ from our data, by finding the ratio of observations in state $k=1$ for each of our partitions, denoted as $p_j$ below:

$$ \pi_{1} (x) = \sum_{j=1}^{J} p_j I(x \in R_j) $$

The $I(x \in R_j)$ depicts our frequency function, or how many observations this probabilities represents. The range $[1, J]$ just checks all of our $J$ partitions we've split our data into. As we can see, $\pi_{1}$ is the sum of all of these local probabilities $p_j$ weighted by their frequency.

The local probability $p_j$ can be approximated from our data set by finding how many observations in this interval actually have response values equal to $1$. Again, the $I(y=1)$ refers to the frequency of observations which help us to count each of the variables in this category $y=1$.

$$\hat p_j= \frac{1}{n_j} \sum_{i\in R_j} I(y_i=1)$$

So far we've found the long term probabilities of our different states, $\pi_k$, along with the local probabilities of new observations having response variable $y_{new}=1$, $p_j$. We can make this slightly more general by saying $p_{j,k}$ is the probability of data in region $R_j$ having response variable $y=k$.

We're soon ready to compare our model to model's developed with other methods, we're just missing our comparison metric.

Deviance for binomial example

We would now like to measure how accurate our model is for new data. This can be done by finding the model's distribution of the deviance. Just like with regression trees, higher deviances means less accurate models.

Since the state space in our example contains only 2 possible states $K = \{k_0=0, k_1=1\}$, we can use the deviance of the binomial distribution as our model's deviance function. (For larger state spaces, other metrics can help find a model's error. Some of these are mentioned later.)

$$ D_{binomial} = -2 \sum _{i=1}^n \{y_i log(\hat \pi_1) + (1- y_i)log(1-\hat \pi_1)\} $$

Here, we see that this sum is just the value of each of the data's response variables (which has to be 0 or 1) multiplied by the log of the long term probability of state $k_1=1$ and added to the opposite state multiplied by the log of the same probability. This means for every observation,

  • if $y_i=0$ then $0 \cdot log(\pi_1) + 1\cdot log(1 - \pi_1) = log(\pi_0)$ is added to the sum

  • if $y_i=1$ then $1\cdot log(\pi_1) + 0\cdot log(1-\pi_1) = log(\pi_1)$ is added to the sum

Given that we defined $p_j$ to be the probability of $y_i={1}$ for region $R_j$, and we can rewrite this deviance function as:

$$ D = -2 \sum_{j=1}^J n_j[\hat p_j log(\hat p_j) + (1 - \hat p_j)log(1-\hat p_j)] = \sum_j D_j $$

Let's break this down a bit. For each region of our model here, the sum of the product of the probability of getting $y=1$, $p_j$, and the log of the same probability, $log(p_j)$, with the product of the inverse probability $(1-p_j)$ and it's log $log(1-p_j)$ is multiplied by the number of observations,$n_j$ in this region. These are then added together for all the regions in our model.

image.png

Here, we can also point out that the log of any number between 0 and 1 is negative. And since we are checking 2 parts of a percentage, $p_n$ and $(1-p_n)$. This is why the $-2$ is needed in front of the summation.

Moreover, the closer the number is to 1, the closer it's log will get to 0. Since we are working with percentages here, these are the only values of $log(x)$ we are interesting in.

Multiplying the log of a percentage with the actual percentage gives an interesting metric. Because $log(x)$ is not a linear function, it does not have the same rate of change as our $x$-axis. As a result, products of percentages close to 0 or 1 and their respective logs are smaller than those of percentages closer to 0.5.

This is precisely why $log(x)$ is chosen as this metric. As shown in the simple R code below, as $p_j$ gets closer to 1, the absolute value of the respective deviance gets smaller and smaller.

In [42]:
x.log <- seq(1, 9)*.1
y.log <- abs((x*log(x) + (1-x)*log(1-x)))

x.log
round(y.log, 1)

plot(x.log,y.log, main='Deviance using p*log(p)', type='b')
  1. 0.1
  2. 0.2
  3. 0.3
  4. 0.4
  5. 0.5
  6. 0.6
  7. 0.7
  8. 0.8
  9. 0.9
  1. 0.3
  2. 0.5
  3. 0.6
  4. 0.7
  5. 0.7
  6. 0.7
  7. 0.6
  8. 0.5
  9. 0.3

Applying this logic to our equation over tells us that regions with balanced probabilities, $p_j=0.5$, give high variance while regions with very small or larger $p_j$ give lower variance.

Further, our model's deviance can also be referred to as the sum of the deviances from all of the regions of the model.

More general error indexes

There are different indexes of misclassification error we can use instead of the deviance of the binomial distribution. This is handy when we are trying to compare response variables with more than 2 possible states. In order to use these other metrics, we must first rewrite the above equation once more, this time incorporating a function of entropy, $Q(\hat p_j)$.

$$ D = 2n \sum _{j=1}^J \frac{n_j}{n} Q(\hat p_j) $$

Entropy gives us a measure of the impurity of a region, i.e. it tells us how non-homogeneous data within a region is. $Q=0$ says the region is 100% pure, while $Q=\frac{1}{2}$ says the region is split 50%-50%. Notice this is the same we got for our binomial deviances above. Perfectly balanced probabilities between different states for a given region give very high deviances.

$Q(p_j)$ using the above method can be written generally as:

$$ Q(p_j) = \sum _{k \in K} p_{jk} log(p_{jk}) $$

Another misclassification index we can use for $Q(p)$ instead is called the Gini-index, and is written:

$$ Q(p_j) = \sum _{k \in K} p_{jk} (1-p_{jk}) $$

Here, instead of multiplying by a log, we instead multiply by the inverse probabilities for each state space, for each region.

In [43]:
x.lin <- seq(1:9)*.1
y.lin <- x.lin*(1-x.lin)

plot(x.lin, y.lin, main='Deviance using p(1-p)',type='b')

As we can see, this gives the same relative distribution of deviance as using our binary entropy indicator above.

R code

Before we start looking at classification trees with R, lets read in some helper functions. These are pulled from a program developed for the lecture presenting these topics.

R code

Let's now big a classification tree model using R. We will use an on-line data set about juices. Each observation contains a response variable for choice, which has value either 'CH' or 'MM'.

We start by reading in our data:

In [70]:
juice <- read.table("http://azzalini.stat.unipd.it/Book-DM/juice.data", header = TRUE) 
juice[ , "store"] <- factor(juice[ , "store"])
v <- c(1, 3, 6:9, 13, 21)
juice <- juice[ , v] 
head(juice)
A data.frame: 6 × 8
choiceweekpriceCHpriceMMdiscountCHdiscountMMloyaltyMMstore
<fct><int><dbl><dbl><dbl><dbl><dbl><fct>
1CH2371.751.990.000.00.5000001
2CH2391.751.990.000.30.4000001
3CH2451.862.090.170.00.3200001
4MM2271.691.690.000.00.6000001
5CH2281.691.690.000.00.0434650
6CH2301.691.990.000.00.0347720

Our next step will be to split our data into a training and a test set, for testing our model.

In [71]:
#--select training and test
set.seed(123)
n <- nrow(juice)
n1 <- round(n * 0.75)
n2 <- n - n1
permutation <- sample(1:n,n)
training <- sort(permutation[1:n1])
juice1 <- juice[training, ]
juice2 <- juice[-training, ]

We can now fit our model, again making use of the tree library.

In [72]:
set.seed(123)
part1 <- sort(sample(training, 600))
part2 <- setdiff(training, part1)
f1 <- as.formula(paste("choice ~ ", paste(names(juice1)[-1], collapse = "+"),
                        collapse = NULL))
t1<- tree(f1, data = juice[part1, ], control = tree.control(nobs = length(part1),
                                                            minsize = 2, mindev = 0))

t1 now represents our extremely over-fit tree, just like in regression trees. Don't really understand what the first two parts are here.

We can now try pruning our tree, to see which complexity gives the lowest deviance.

In [73]:
t2<- prune.tree(t1, newdata = juice[part2, ])
# figure 5.20a
plot(t2)

As you can see, our model has quite a lot of steps. Let's find an automatic way of finding the J giving the lowest complexity.

In [74]:
J <- t2$size[t2$dev == min(t2$dev)]
J

t3 <- prune.tree(t1, best = J)
# figure 5.20b
plot(t3)
text(t3)
5
In [76]:
# More details of our model

p3 <- predict(t3, newdata = juice2, type = "class")
# table 5.11
confusion.matrix(p3, juice2[ , "choice"])
#
p3 <-  predict(t3, newdata = juice2, type = "vector")[ , 2]
a <- lift.roc(p3, as.numeric(juice2[ , "choice"] == "MM"), type = "bin", plot.it = FALSE)
# figure 5.21a
plot(a[[1]], a[[2]], type = "b", xlab = "Fraction of predicted subjects",
     ylab = "Improvement factor", col = 2, pch = 16, cex = 2)
# figure 5.21b
plot(a[[3]], a[[4]], type = "b", xlim = c(0, 1), pch = 16,
     ylim = c(0, 1), cex = 2, xlab = "1-specificity",
     ylab = "Sensibility", col = 2)
         observed
predicted  CH  MM
       CH 122  17
       MM  42  87
total error:  0.2201493 
false positives/negatives:  0.2560976 0.1634615 

Let's look at one more example before we finish. This code was written and taken from the same website the data has been pulled from.

In [77]:
dataset <- read.table("http://azzalini.stat.unipd.it/Book-DM/classes.dat", header = TRUE)
x1 <- dataset[ , 1]
x2 <- dataset[ , 2]
gr <- as.factor(dataset[ , 3])
K <- 3
t3 <- tree(gr ~ x1 + x2, control = tree.control(nobs = nrow(dataset), mindev = 0, minsize = 2))
# figure 5.22a
plot(t3)

set.seed(5)
t4 <- cv.tree(t3)
# figure 5.22b
plot(t4)

t5 <- prune.tree(t3, k = 10)
# figure 5.22c
plot(t5)
text(t5)

n.grid <- 250
p <- pred.square(t5, x1, x2, n.grid)
pred <- array(p$pred, c(n.grid, n.grid,3))
ind <- apply(pred, c(1, 2), order)[3, , ] 
# figure 5.22d
plot(dataset[ , 1:2], type = "n", xlab = expression(italic(x)[1]),  
     ylab = expression(italic(x)[2]))
for(k in 1:3)
  points(dataset[gr == k, 1:2], pch = 15 + k, col = 1 + k, cex = 1.5)
contour(p$x, p$y, ind, add = TRUE, drawlabels = FALSE, nlevels = 2, lty = 1, lwd = 2)
text(-4, 1, labels = 1, cex = 2)
text(3, 0, labels = 2, cex = 2)
text(2, -4, labels = 3, cex = 2)

You may have notice I used some functions that aren't included in any of the libraries we imported. These "helper functions" were developed by A. Azzalini, and are a supplement to his book, Data Analysis and Data Mining, which I have used as my guideline for this tutorial.

In [66]:
# help functions
pred.square <- function(model, x1, x2, grid = 50, Z.flag = FALSE)
{
  u <- pretty(x1, n = 10)
  x <- seq(u[1], u[10], length = grid)
  u <- pretty(x2, n = 10)
  y <- seq(u[1], u[10], length = grid)
  nx <- length(x)
  ny <- length(y)
  xoy <- cbind(rep(x, ny), as.vector(matrix(y, nx, ny, byrow = TRUE)))
  X <- matrix(xoy, nx * ny, 2, byrow = FALSE)
  if(Z.flag) pred <- predict(model, newdata = data.frame(z1 = X[ , 1], z2 = X[ , 2]))
  else  pred <- predict(model, newdata = data.frame(x1 = X[ , 1], x2 = X[ , 2]))
  list(pred = pred, x = x, y = y, X = X)
}

# function to procude the ROC curve
lift.roc <- function(previsti, g, type = "bin", plot.it = TRUE)
{
  library(sm)
  if(!is.numeric(g)) stop("g not numeric")
  ind <- rev(order(previsti))
  n <- length(g)
  x1 <- (1:n)/n
  x2 <- cumsum(g[ind]) / (mean(g) * (1:n))
  if(type == "crude" & plot.it) 
    plot(x1, x2, type = "l", col = 2,
         xlab = "fraction of predicted subjects", ylab = "lift")
  if(type == "sm") {
    a<- sm.regression(x1, x2, h = 0.1, display = "none")
    if(plot.it)
      plot(a$eval, a$estimate, type = "l", xlim=c(0,1), col = 2,
           xlab = "fraction of predicted subjects", ylab = "lift")
  }
  if(type == "bin") {
    b <-  binning(x1, x2, breaks = (-0.001:10) / 9.999)
    x <- c(0, seq(0.05, 0.95, by = 0.1), 1)
    if (plot.it) plot(x, c(x2[1], b$means, 1), type = "b", xlim = c(0, 1),
                      ylim = c(1, max(x2)), cex = 0.75, col = 2,
                      xlab = "fraction of predicted subjects",
                      ylab = "improvement factor")
    x1 <- x
    x2 <- c(x2[1], b$means,1)
  }
  if(plot.it) pause("press enter")
  u1 <- cumsum(1 - g[ind]) / sum(1 - g)
  u2 <- cumsum(g[ind]) / sum(g)
  if(type == "crude" & plot.it)
    plot(u1, u2, type = "l", xlim = c(0, 1), ylim = c(0, 1), col = 2,
         xlab = "1-specificity", ylab = "sensibility")
  if(type == "sm"){
    # browser()
    eps <- 0.00001
    a <- sm.regression(u1, log((u2 + eps) / (1 - u2 + 2 * eps)), h = 0.1, display = "none")
    q <- exp(a$estimate) / (1 + exp(a$estimate))
    if(plot.it) plot(a$eval, q, type = "l", xlim = c(0, 1), ylim = c(0, 1),
                     xlab = "1-specificity", ylab = "sensibility", col = 2)
  }
  if(type == "bin"){
    b <- binning(u1, u2, breaks = (-0.001:10) / 9.999)
    x <- c(0, seq(0.05, 0.95, by = 0.1),1)
    y<- c(0, b$means,1)
    if(plot.it)
      plot(x, y, type = "b", xlim = c(0,1),
           ylim = c(0, 1),cex = 0.75, xlab = "1-specificity",
           ylab = "sensibility", col = 2)
    u1 <- x
    u2 <- y
  }                      
  if(plot.it) {
    abline(0, 1, lty = 2, col = 3)
  }
  invisible(list(x1, x2, u1, u2))
}

# function to compute the confusion matrix
confusion.matrix <- function(predicted, observed){
  a <-  table(predicted, observed)
  err.tot <- 1 - sum(diag(a)) / sum(a)
  fn <- a[1, 2] / (a[1, 2] + a[2, 2])
  fp <- a[2, 1] / (a[1, 1] + a[2, 1])
  print(a)
  cat("total error: ", format(err.tot),"\n")
  cat("false positives/negatives: ",format(c(fp, fn)),"\n")
  invisible(a)
}