In this notebook, we will explore some basic criteria for variable selection within models, along with the difference between qualitative and quantitative variables. It can sometimes be helpful to reduce the number, or dimensions, of variables being used to create models when it comes to visualizing and interpreting results of the model. In most cases, with dimensions higher than three, it becomes complicated to present correlations of a model in an understandable fashion. We therefore would like to find some way to evaluate and compare different combinations of variables, to see which gives the lowest mean square error of new data.
As you've probably noticed, mean square error of new data is usually the go-to criteria when comparing model's since if gives a precise measurement of the efficiency of our model.
In Linear, Logarithmic, and Polynomial Regressions we looked at a data set where both variables consisted of quantitative data. This means that the data had some number value. But not all data needs to have some innate, number value. Take for example flipping a coin, or comparing a placebo to a medicine, or a person's gender. All of these are examples of some non-numeric characteristics data can have. Categorical data like the ones mentioned here is called qualitative data.
Let's read in a data set containing some qualitative data.
oral_ca <- read.csv("oral_ca.csv", header = TRUE)
str(oral_ca)
summary(oral_ca)
As we can somewhat see in the output above, there are some variables with discrete values of either 0 or 1 (ccstatus & sex).
We begin by looking at the correlation between ccstatus and cigs as a dichotomized variable. The table below shows the observed frequencies of cases and controls for observations in the non-smokers and smokers groups. We observe that the number of smokers in the dataset is about three times larger than the number of non-smokers, and also that the majority of smokers are in the case group, while the majority of non-smokers are in the control group.
library(tidyverse)
options(warn=1)
oral_ca <- oral_ca %>%
mutate(cigs.d = ifelse(cigs > 0, "Smokers", "Non-smokers"),
ccstatus.d = ifelse(ccstatus == 1, "Case", "Control"))
table(oral_ca$cigs.d, oral_ca$ccstatus.d)
We can also compute the estimated probabilities, with their standard errors, of being a case for each of the groups (“Smokers” and “Non-smokers”). The results are shown below. We also include the probability for being a case for the observations in the sample as a whole. The estimated probability of a smoker being a case is approximately 0.562, with a standard error of 0.028. For a non-smoker the corresponding numbers are 0.242 and 0.045. (For the whole sample the estimated probability is 0.489 with a standard error of 0.025.)
probs <- oral_ca %>%
summarise(pi.smoke.hat =
sum(ccstatus[cigs.d == "Smokers"]) / sum(cigs.d == "Smokers"),
pi.nonsmoke.hat =
sum(ccstatus[cigs.d == "Non-smokers"]) / sum(cigs.d == "Non-smokers"),
pi.common.hat =
sum(ccstatus) / n(),
se.pi.smoke.hat =
sqrt(pi.smoke.hat * (1 - pi.smoke.hat) / sum(cigs.d == "Smokers")),
se.pi.nonsmoke.hat =
sqrt(pi.nonsmoke.hat * (1 - pi.nonsmoke.hat) / sum(cigs.d == "Non-smokers")),
se.pi.common.hat =
sqrt(pi.common.hat * (1 - pi.common.hat) / n()))
t(probs)
Using the estimated probabilities and their standard errors from above, we compute the likelihood ratio test statistics, w. This test statistics is 2(log(L1)–log(L0)) where L0 is the maximized loglikelihood function under the null hypothesis that the probabilities are equal (i.e. does not depend upon being a smoker or not) and L1 the maximized log-likelihood function without that constraint (Azzalini & Scarpa, 2012). We also compute the p-value of w based on the chi-squared distribution with 1 df. As the p-value is close to zero, we reject the null hypothesis of equal probability.
res <- oral_ca %>%
summarise(llik_pi.smoke.hat_pi.nonsmoke.hat =
sum(dbinom(ccstatus[cigs.d == "Smokers"], 1, probs$pi.smoke.hat, log = TRUE)) +
sum(dbinom(ccstatus[cigs.d == "Non-smokers"], 1, probs$pi.nonsmoke.hat, log = TRUE)),
llik_pi.common.hat = sum(dbinom(ccstatus, 1, probs$pi.common.hat, log = TRUE)),
w = 2 * (llik_pi.smoke.hat_pi.nonsmoke.hat - llik_pi.common.hat),
p.value = 1 - pchisq(w, df = 1))
t(res)
Here we will fit a linear logistic model with cigarettes as a dichotomized variable. To fit a linear logistics model, we use the glm
function, and the results are shown below. As the coefficient of cigs is positive and significant at less than 0.1 percent significance level, being a smoker seems to increase the risk of oral cancer. Specifically, the increase in log-odds from smoking is approximately 1.39. This corresponds to an increase in odds by the exponential of the log-odds, which is approximately 4.028. We can also see this by the calculation below. Furthermore, the intercept coefficient is the log-odds of cancer for the reference group of not being a smoker.
mod.c <- glm(ccstatus ~ cigs.d,
family = "binomial",
data = oral_ca)
summary(mod.c)
beta <- mod.c$coefficients
pi_values <- data.frame(pi_smoke = exp(beta[1] + beta[2]) / (1 + exp(beta[1] + beta[2])),
pi_nonsmoke = exp(beta[1]) / (1 + exp(beta[1])))
log_odds <- pi_values %>%
mutate(odds_smoke = pi_smoke / (1 - pi_smoke),
odds_nonsmoke = pi_nonsmoke / (1 - pi_nonsmoke),
log_odds_smoke = log(odds_smoke),
log_odds_nonsmoke = log(odds_nonsmoke),
diff_log_odds = log_odds_smoke - log_odds_nonsmoke,
diff_odds = exp(diff_log_odds))
t(log_odds)
Instead of using a dichotomized variable for cigs, we now use the continuous variable in the model. The results are shown below. The coefficient of cigs of approximately 0.054 is now the expected change (increase) in log-odds from one additional cigarette smoked per day. Although they are both related to the odds for non-smokers, the coefficient of the intercept changes with respect to pt. (c) because cigs is now a continuous variable.
mod.d <- glm(ccstatus ~ cigs,
family = "binomial",
data = oral_ca)
summary(mod.d)
We now include the other three variables (drinks, age and sex) in the model as well, and the results are shown below. Of the three variables, drinks and sex are significant at 5 percent significance level or less, and both have a positive correlation with the risk of oral cancer. The coefficient of the variable age is also positive, but it is not significantly different from zero. Furthermore, we observe that the increase in the log-odds for one more cigarette smoked per day now is 0.035, which is a lower value than in pt. (d) where it was 0.054. The reason for this is that in pt. (d), the model suffered from omitted variable bias. That is, one or more relevant variables correlating both with ccstatus and with cigs were left out of the model, resulting in the coefficient of cigs being estimated to be too high. Finally, we can notice that AIC is 453.8.
mod.e <- glm(ccstatus ~ cigs + drinks + age + sex,
family = "binomial",
data = oral_ca)
summary(mod.e)
The coefficient of age has a positive value, but it is not statistically significant at 5 percent level, thus we do not reject the null hypothesis that the coefficient is zero. Hence, it does not seem to be the case that older people risk more, controlling for the other variables (cigs, drinks and sex). We now fit a new model excluding age, and the results are shown below. Specifically, the coefficients of cigs, drinks and sex are more or less the same as in pt. (e), and AIC is slightly improved to 452.3. As the model without age is simpler than the model including age, and also improves AIC, I would prefer this simpler model.
mod.f <- glm(ccstatus ~ cigs + drinks + sex,
family = "binomial",
data = oral_ca)
summary(mod.f)
We still exclude age and instead include a polynomial of degree 2 to model the effect of drinks, where the results are shown below. The coefficients of all variables are statistically significant at a 5 percent significance level, and AIC is further improved to 447.6, compared to 452.3. Hence, including a polynomial of degree 2 appears to improve the model
mod.g <- glm(ccstatus ~ cigs + drinks + I(drinks^2) + sex,
family = "binomial",
data = oral_ca)
summary(mod.g)
If we instead include a quadratic term for cigs, we get the results shown below. It appears that including a quadratic term for cigs does not improve the model as the coefficient of cigs$^2$ is not significant and AIC is increased to 453.6.
mod.h <- glm(ccstatus ~ cigs + I(cigs^2) + drinks + sex,
family = "binomial",
data = oral_ca)
summary(mod.h)
Finally, we can build models including a cubic term for drinks and cigs, respectively, to see if that improves the fit. However, as the results below show, neither models appear to be an improvement compared to the simpler ones, as the coefficients of the second and third order polynomials are not statistically significant in any of the models. Also, AIC is worse in both models compared to the AIC of the model in pt. (g). Hence, the best model seems to be the one with quadratic effect for drinks.
mod.i1 <- glm(ccstatus ~ cigs + drinks + I(drinks^2) + I(drinks^3) + sex,
family = "binomial",
data = oral_ca)
summary(mod.i1)
mod.i2 <- glm(ccstatus ~ cigs + I(cigs^2) + I(cigs^3) + drinks + sex,
family = "binomial",
data = oral_ca)
summary(mod.i2)
Finally, we examine which model is the best for prediction by splitting the data into a training set and a test set, where we use 3/4 of the data for training and 1/4 for testing. To ensure that results can be repeated, we use set.seed. However, it should be noticed that the computed errors of the training and test data for the different models will depend on the split into training and test. If the data were split using a different set.seed value, the results may change.
set.seed(2468)
######
# Break data into train/test sets and fit pervious models
## 75% of the sample size
train.size <- floor(0.75 * nrow(oral_ca))
## set the seed to make your partition reproducible
set.seed(13) # Will need to delete/update if I'm trying to run this simulation multiple times
train.indexes <- sample(seq_len(nrow(oral_ca)), size = train.size)
# Training/testing sets
oc.train <- oral_ca[train.indexes, ] # set of indexes from sample()
oc.test <- oral_ca[-train.indexes, ] # rest of indexes
MSE.oral.train = NULL
MSE.oral.test = NULL
# fit to models described in previous parts
model.cig.train <- glm(ccstatus ~ cigs, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.cig.train)
MSE.oral.train[1] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.cig.train, newdata = oc.test, type = "response")
MSE.oral.test[1] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.all.train <- glm(ccstatus ~ cigs + drinks + sex + age, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.all.train)
MSE.oral.train[2] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.all.train, newdata = oc.test, type = "response")
MSE.oral.test[2] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.noage.train <- glm(ccstatus ~ cigs + drinks + sex, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.noage.train)
MSE.oral.train[3] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.noage.train, newdata = oc.test, type = "response")
MSE.oral.test[3] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.cig2.train <- glm(ccstatus ~ cigs + I(cigs^2) + drinks + sex, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.cig2.train)
MSE.oral.train[4] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.cig2.train, newdata = oc.test, type = "response")
MSE.oral.test[4] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.drink2.train <- glm(ccstatus ~ cigs + drinks + I(drinks^2) + sex, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.drink2.train)
MSE.oral.train[5] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.drink2.train, newdata = oc.test, type = "response")
MSE.oral.test[5] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.drink3.train <- glm(ccstatus ~ cigs + drinks + I(drinks^3) + sex, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.drink3.train)
MSE.oral.train[6] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.drink3.train, newdata = oc.test, type = "response")
MSE.oral.test[6] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
model.drink23.train <- glm(ccstatus ~ cigs + drinks + I(drinks^2) + I(drinks^3) + sex, data=oc.train, family = "binomial"(link="logit"))
pred <- predict(model.drink23.train)
MSE.oral.train[7] = sum(((oc.train$ccstatus - pred)^2 ) / length(oc.train$ccstatus))
pred <- predict(model.drink3.train, newdata = oc.test, type = "response")
MSE.oral.test[7] = sum(((oc.test$ccstatus - pred)^2 ) / length(oc.test$ccstatus))
# plot MSEs for different models
plot(1:7, MSE.oral.train, type='b', col=2, xlab="Model", main="MSE of train sets on models")
legend("bottomright", legend=c("1 = cigs", "2 = all", "3 = no age", "4 = cigs^2", "5 = drinks^2", "6 = drinks^3", "7=drinks^2+drinks^2"))
plot(1:7, MSE.oral.test, type='b', col=2, xlab="Model", main="MSE of test sets on models")
legend("topright", legend=c("1 = cigs", "2 = all", "3 = no age", "4 = cigs^2", "5 = drinks^2", "6 = drinks^3", "7=drinks^2+drinks^2"))