In these exercises, we'll use R to measure any correlation between a person's body mass index and their peripheral blood film measurements. We have a dataset called bodyfat.csv
with attributes bmi
and pbfm
.
Let's take a look a basic linear regression in R.
Start by reading in the data:
bodyfat <- read.csv("res_bodyfat.csv", header=TRUE)
plot(bodyfat)
attach(bodyfat) # stores varaibles globally
We can see here that there's a third variable here msbp_missing
that I didn't mention earlier. This will be used in a later exercise, but for we're just going to ignore it.
What we're really interested in here are those two graphs to the top left.
Notice how the data is somewhat grouped together. We can fit a linear regression line to the data, which would give us a flat line that follows the general trend of the data.
fit.linear <- lm(pbfm ~ bmi, data=bodyfat) # data isn't needed here since attach(bodyfat was called)
beta <- coef(fit.linear)
plot(bmi, pbfm, main="Non transformed")
lines(bmi, beta[1] + beta[2]*bmi, col=2)
legend(40, 20, legend = c("linear"), col=c(2), lty=1)
This linear model can be good for a quick, rough estimate to find the trend the data follows, but fails to pick up the curvature of the data. Models created with higher order polynomials or fit on a logarithmically transformed axis could give better fits.
The better the fit, the more accurate the model with be in guessing future data, which is usually what we are trying to achieve. We can compare how well different models fits the data they are representing by comparing different models' mean square errors.
The mean squared error, or $MSE$, is the average of the squared residuals.
In other words, for every data point, we compare the estimated y-value (value from our model) with the actual y-value (data point's y-value). These differences are our residuals. We then square all our residuals, to make sure negative values are properly accounted for. The final step is to take the average over all these residuals squared. The model that gives the lowest $MSE$ fits the data the best.
Some information of our model fit.linear
can be found by running summary('fit.linear')
. Our residuals are found at fit.linear$residuals
. Let's go ahead and calculated this model's $MSE$.
summary(fit.linear)
MSE.linear <- mean(fit.linear$residuals^2)
MSE.linear
Our linear model gives a mean square error of about 13.624.
Now, let's see what mean square error we get if we fit a linear model on the same data, but with a logarithmic transformation along our bmi
-axis.
bmi.log = log(bodyfat$bmi) # bodyfat$ not needed here since attach(bodyfat) was called
plot(bmi.log, pbfm, main="Log. transformed")
fit.log <- lm(pbfm ~ bmi.log)
summary(fit.log)
beta.log <- coef(fit.log)
lines(bmi.log, beta.log[1]+beta.log[2]*bmi.log, col=2, lwd=2)
legend(3.6, 20, legend=c("log"), col=c(2), lty=1) # lty=line type and must be present for drawing in legend
MSE.log <- mean(fit.log$residuals^2)
MSE.log
This fit has a mean square error of about 9.705. Notice, the graph is still flat, but the mean square error is now smaller. If you're wondering what happened, remember that we transformed our bmi
-axis logarithmically. This allowed us to retain the quick, simplicity of a linear regression, while still picking up some of the curvature of the data.
If you want to see what this model looks like on the original data, you can run something like:
plot(bmi, pbfm, main='Log on original data')
points(bmi, beta.log[1] + beta.log[2]*bmi.log, col=2, pch=18)
Let's see if we can get an even better fit by transforming the bmi
-axis quadratically.
fit.quad <- lm(pbfm ~ bmi + I(bmi^2))
summary(fit.quad)
beta.quad <- coef(fit.quad)
x <- seq(min(bmi), max(bmi), length=200) # Needed so that x's are in asc order
plot(bmi, pbfm, main="Linear and Quad effects")
lines(bmi, beta[1] + beta[2]*bmi, col=2, lwd=2)
lines(x, beta.quad[1] + beta.quad[2]*x + beta.quad[3]*x^2, col=4, lwd=2)
legend(40, 20, legend=c("linear", "quad"), col=c(2, 4), lty=1) # lty=line type and must be present for drawing in legend
MSE.quad <- mean(fit.quad$residuals^2)
Let's compare the three mean square errors in a table. A will represent the linear $MSE$, B the log, and C the quadratic.
as.table(c(MSE.linear, MSE.log, MSE.quad))
Of these three, the model with the quadratically transformed bmi
-variable gave the lowest mean square error, and in turn gave the best fit for these data. If we, however increase the degree of polynomial $p$ even more, we should expect to see an even lower $MSE$. As $p \rightarrow \infty $, the model's mean square error shrinks, $ MSE \rightarrow 0$.
Before we check models using any higher degree of polynomial, lets talk a little about what fitting a model actually is doing.
The ultimate goal of model-fitting is to find a function that learns correlations from a dataset in order to predict future data. In other words, the goal is to see through any "noise" this data may contain, and rather find the general trend of the data.
With every increase in degree of polynomial $p$, the model gets step closer to complete interpolation of the data being used to train the model. Eventually, the model will start giving really high precision on it's training data set, but very low precision on new data. This lower error on training data but higher error on test data is a sign of an over-fit model.
We now have seen two types of models, under-fit and over-fit. The under-fit models were those with high mean square errors due to a lack of complexity. Over-fit models are those with high mean square errors due to an excess of complexity. This means there must exist some minimum mean square error, which gives the perfect balance between high and low complexity.
The best way to find that perfect balance between under-fit and over-fit models is by splitting the data you have access to into some test data and training data.
For example, you can try splitting the data into 4 parts, where you use three as you training set and the last part as your test set.
To get even more out of you data, you can then build models for each of the four possible combinations of these 4 parts into groups of 3 and 1.
$$ ABC_{train} D_{test} \quad ABD_{train} C_{test} \quad ACD_{train} B_{test} \quad BCD_{train} A_{test} $$You can then take the averages of all four models to get an even more general model. This process of breaking the data into identically sized parts, creating a model for each training/test set grouping of these parts, to finally give the average coefficient values which gave the lowest means square errors, is called cross validation.
In R, we use boot
-library's cv.glm()
to implement this method. This function randomly splits the data into $K$ groups, places $K-1$ parts into a training set, trains a model on this set, then tests the model on the last group of data, then regroups, retrains, and retests until every partition has been used as a test set.
Since cv.glm()
makes use of R's built in random variable, we can set to a specific seed so that the "randomly chosen" splitting and sorting of data can be repeated (i.e. gives same split each time specific seed is set). This means that we can also set new seeds, and run the same cross validations on completely different partitions of the data.
With this in mind, I decided to calculate cross validations with multiple different random seeds, in order to get an even more general model.
In the code below, I compare 27 different random seeds. Each seed fits a polynomial of degree $p$ between 1 and 10. For every degree of polynomial, a 4-fold cross validation is run to find the overall average square error between the $K=4$ groupings. The $MSE$ for each $p$ is then plotted to show the relation between mean square error and degree of polynomial used to fit the model, for that specific random seed.
This function has two nested for loops, which means it'll take $27\cdot 10 \cdot 4=1080$ iterations to fully execute.
# Cross validation
library(boot)
# Set up plots to show all test errors
par(mfrow=c(3,3))
# This will be where the list of MSEs for every random split is stored
MSE = NULL
# Run test factor of 9 times to plot all on same sheet
for(s in 1:27){
# These are the ind. MSE per poly degree for this random split
MSE.random = NULL
# resetting seed fro this iteration
set.seed(7*s)
for(i in 1:10){
#generate linear model
model <- glm(pbfm ~ poly(bmi, degree=i), data=bodyfat)
# find cross-validation for this model
MSE.4.fold <- cv.glm(bodyfat, model, K=4)$delta[1]
# append MSE found to this split's list
MSE.random[i] <- MSE.4.fold
}
# Plot these MSEs
plot(c(1:10), MSE.random, type="b")
# Store list of MSEs for this iterationto main MSE list (which will be avg later)
MSE <- cbind(MSE, MSE.random)
}
# Take average of all MSE values for each poly degree
MSE.avg <- (rowMeans(MSE))
#reconfigure plot
par(mfrow=c(1,1))
# show results
plot(1:10, MSE.avg, type="b", main="Averages over MSEs from Cross Validation", ylim=c(8, MSE.avg[1]), col=2)
The degree of polynomial with the lowest $MSE$ can be found by running:
which.min(MSE.avg)
min(MSE.avg)
This tells us that fitting a model with the bmi
-variable transformed to a 4-degree polynomial will give the best fit for future data.
Another technique for model selection is to use an information criterion (IC) which applies a penalty for increasing the number of parameters. Here, both Akaike information criterion (AIC), which uses the penalty 2p, and Bayesian information criterion (BIC), which uses the penalty p log(n), is employed, and the results are plotted below.
We see that model selection by AIC yields the same result as LOOCV, with AIC being at its lowest with degree 4 of the polynomial.
The BIC, on the other hand, penalises model complexity more than both AIC and LOOCV, reflected in BIC being the lowest for the model with degree 3 of the polynomial.
According to Hastie et al. (2009), for very large sample sizes AIC tends to choose too complex models, while for smaller sample sizes, BIC tends to choose too simple models due to its penalty on complexity. Therefore, with N = 327 observations in our dataset, I would choose the model where IC is lowest based on AIC, that is, the model with degree 4 of the polynomial.
AIC_BIC <- data.frame(p = 1:10,
AIC = numeric(10),
BIC = numeric(10))
for (p in AIC_BIC$p){
formula <- bquote(pbfm ~ poly(bmi, .(p)))
AIC_BIC$AIC[p] <- AIC(lm(as.formula(formula), data = bodyfat))
AIC_BIC$BIC[p] <- BIC(lm(as.formula(formula), data = bodyfat))
}
min_AIC <- which(AIC_BIC$AIC == min(AIC_BIC$AIC))
min_AIC
min_BIC <- which(AIC_BIC$BIC == min(AIC_BIC$BIC))
min_BIC
library(tidyverse)
fig_AIC_BIC <- AIC_BIC %>%
gather(`Information criterion`, Value, AIC, BIC) %>%
ggplot() +
aes(x = p, y = Value, col = `Information criterion`) +
geom_line(size = 1) +
geom_point(aes(x = min_AIC, y = AIC_BIC$AIC[min_AIC]), col = "blue", size = 3) +
geom_point(aes(x = min_BIC, y = AIC_BIC$BIC[min_BIC]), col = "red", size = 3) +
scale_x_continuous(breaks = c(1:10)) +
scale_colour_manual(values = c("blue", "red")) +
theme_bw() +
labs(x = "Model complexity (order of polynomial)",
y = "AIC and BIC criterion") +
theme(legend.position = "bottom")
fig_AIC_BIC