Model selection

Author

Panagiotis Papasaikas

Published

December 11, 2024

Recap on linear regression

What is linear regression

  • Generally, a model formalizes the relationship between variables: \(y = f(x_i)\)
    Notation: \(y\) and \(x_i\) (\(x_1\), \(x_2\), …) are random variables
  • By defining \(f(\ldots)\) we formalize an assumption about the type of relationship
  • linear models model or predict a variable (\(y\)) as a linear combination of one or several explanatory variables (\(x_i\)):
    \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n\)
  • \(y\) is the response variable or dependent variable (it depends on the \(x_i\))
  • \(x_i\) are the independent variables or predictors (they allow prediction of \(y\))
  • \(\beta_i\) are the parameters to fit (coefficients).
    In the simple case of \(y = \beta_0 + \beta_1x_1\), \(\beta_0\) is the intercept and \(\beta_1\) the slope)
  • The relationship between \(y\) and the \(x_i\) can be formally written as:

\(y = \beta_0 + \sum_{i}{\beta_i x_i} + \epsilon\)

or in matrix notation:

\(y = \beta_0 + \beta \cdot X + \epsilon\)

  • \(\epsilon\) contains the errors (e.g unkown influences, measurement errors) or residuals, often assumed to follow a normal distribution with mean zero: \(\epsilon \sim \mathcal{N}(0,\sigma^2)\)
  • \(\epsilon\) simply is the difference between the observed values \(y\) and the corresponding predicted values \(\hat{y} = \beta_0 + \sum_{i}{\beta_i x_i}\)

\(\epsilon = y - \hat{y}\)

  • Fitting a linear model” means finding values for \(\beta_i\) that minimize the residuals, for example the sum of their squares

\(\min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2\)

  • Linear regression can be used for:

    1. Predictive modeling that is, training a model on observed data for use in prediction of unobserved \(y\) using new values of \(x_i\).

    2. Studying the association of \(x_i\) to \(y\) (statistical modeling: Given \(y\) and several \(x_1, x_2, \ldots, x_n\) that may be related to \(y\), (linear) models can be applied to quantify the strength of the association between \(x_i\) and \(y\): An \(x_i\) that is strongly associated to \(y\) is a strong predictor for \(y\) and the predictive performance of a model that includes \(x_i\) will be significantly better compared to the model without \(x_i\) as predictor.

Creating and interpreting linear models in R

  • In R, linear models are created using lm()

  • For a model \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\), the formula is of the form: y ~ x1 + x2

    • multiple regressors are combined using +
    • the intercept (constant term) \(\beta_0\) is implicitly included, unless you say y ~ x1 + x2 + 0
  • Interactions between predictors can be added to the model using either : or * which enumerates all possible interaction terms in your model.

    • e.g : fit <- lm(y ~x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3, data = DF) or simply: fit <- lm(y ~x1 * x2 * x3, data = DF)
  • lm() returns an R object of class lm, for which there are a number of methods available:

    • coefficients(fit) give you back the \(\beta_i\) (e.g. slope and intercept terms in the case of a single predictor).
    • residuals(fit) give you back the \(\epsilon\) values. You could for example calculate the residuals sum of squares using sum(residuals(fit)^2).
    • predict(fit) give you back the fitted values for the response (the \(\hat{y}\)). You can also use this function to predict the response for new observations \(x_i\), using: predict(fit, newdata=data.frame).
    • summary(fit) is a great summary of the overall results of the regression.
    • The lm object has its own plot method, which produces plots to assess the quality of a linear model

Principles of model selection

Model selection refers to the task of selecting a good / the best model among a set of candidate models. Model selection is a critical part of an analysis since a flawed model will lead to incorrect predictions and conclusions.

But what exactly do we mean by “goodness” of a model? Informally we can say that our model needs to be flexible enough to capture the underlying relationship between predictors and response but at the same time should not introduce unecessary complications.

This is in essence the principle known also as Occam’s razor which states that among candidate models of similar explanatory power preference should be given to the simpler model-choices.

Let’s discuss this guiding principle while looking at a motivating example. We want to model the relationship between x and y give the data below:

set.seed(1)
npoints <- 25
x <- runif(npoints, 0, 11)
y <- (2-x) * (x-9) + rnorm(npoints,0,2.5)
plot(x,y,pch=19,col="skyblue4")

Let’s first try to fit a model to this data using a univariate linear model:

fit1 <- lm(y ~ x)
seqx <- seq(0, 11, length.out = 128)
plot(x,y,pch=19,col="skyblue4")
lines(seqx, predict(fit1, data.frame(x = seqx)), col="darkred", lwd=2)

Obviously this model is a poor fit to the data. Assuming a linear relationship is too simplistic and therefore our model lacks the ability to capture the underlying relationship between response and predictors.

Model bias

The previous is an example of underfitting .

The error arising as a result of overly simplistic or more generally of incorrect modelling assumptions is called bias.

Now let’s add to out model predictors that are different polynomial degrees of x. Each added polynomial term is an extra explanatory variable that bestows our model with increasing modelling capacity.

We can specify to lm a second degree polynomial ( \(y = \alpha + \beta_1 x + \beta_2 x^2\) ) as: -lm(y ~ x + I(x^2)) or lm(y ~ poly(x,2, raw=TRUE))

and in general a polynomial of degree n ( \(y = \alpha + \beta_1 x + \beta_2 x^2 + \ldots + \beta_n x^n\) ) as: lm(y ~ x + I(x^2)) + ... + I(x^n) or lm(y ~ poly(x,n, raw=TRUE))

Lets look at the linear model that tries to model the response using the first two polynomial degrees of x:

# A model of the form y= a0 + a1x + a2x^2 :
fit2 <- lm(y ~ poly(x,2, raw=TRUE)) 
plot(x,y,pch=19,col="skyblue4") 
lines(seqx, predict(fit2, data.frame(x = seqx)), col="darkorange", lwd=2)

This instictively looks like a reasonable model choice for our observations. It seems to capture the regularities of the process that gave rise to the data while at the same time being a relatively simple model that only requires two parameters to be fully specified.

# A model of the form y= a0 + a1x + a2x^2 + ... +a15x^15  :
fit3 <- lm(y ~ poly(x,15,raw=TRUE)) 
plot(x,y,pch=19,col="skyblue4")
lines(seqx, predict(fit3, data.frame(x = seqx)), col="green", lwd=2)

This does not look like a good choice. Our model gives us a curve that matches very closely (interpolates) the observations but it does so at the cost of being overly sensitive to small fluctuations of the predictors.

Model variance

The error arising from high sensitivity to the inputs is referred to as variance.

High-variance models capture incosequential details of the observations that are probably the result of noise and therefore are unique to the training data and unlikely to be present in a new set of observation generated by the same process. As a result they miss the “big picture” lowering their performance on new data.

We say that such models overfit the training data and do not generalize well on new observations.

Let’s see in practice how our three models perform a new set of observations:

set.seed(8)
newx <- runif(npoints, 0, 11)
newy <- (2-newx) * (newx-9) + rnorm(npoints,0,2.5)

plot(newx,newy,pch=19,col="skyblue4")
lines(seqx, predict(fit1, data.frame(x = seqx)), col="red", lwd=2)
lines(seqx, predict(fit2, data.frame(x = seqx)), col="darkorange", lwd=2)
lines(seqx, predict(fit3, data.frame(x = seqx)), col="green", lwd=2)

We can now formalize a bit more our sentiment of what constitutes “bad” and “good” models:

Model selection prinnciples summarized
  • The sources of error that depend on model selection can be decomposed in bias and variance errors.
  • Our model selection should control for both types of errors in order to strike a good balance between modelling capacity and potential to overfit.
  • A low bias, low variance model fits well the training data but also performs well on unseen data, that is, it is able to generalize.

Comparing linear models and assessing the value of predictors

Residuals, r.squared and adj.r.squared

In most classes of models - including the case of linear regression - the addition of model parameters either in the form of more predictors or more interaction terms reduces bias and tends to improve the fit to the training data. This is reflected as a reduction of the residuals and an inflation of the r-squared value. This reduction in bias also means that the modelling capacity grows (i.e its variance increases) making our model more prone to overfitting and loss of generalization.

For example for the three linear models discussed above:

summary(fit1)$r.squared
[1] 0.1039079
summary(fit2)$r.squared
[1] 0.9648976
summary(fit3)$r.squared
[1] 0.9859424
sum(residuals(fit1)^2)
[1] 1702.277
sum(residuals(fit2)^2)
[1] 66.68294
sum(residuals(fit3)^2)
[1] 26.70486

Features that result in imperceptible improvements in terms of those metrics are unlikely to boost model predictive performance. However, looking at the residuals and r.squared values is not in itself enough to assess the quality of our model as they evaluate only one aspect of it (the fit to the training data) but does not tell us much about its ability to generalize; these metrics do not account in any way for the increase in model complexity and variance resulting from additional explanatory variables.

One measure that attempts to account for the spurious increase in r.squared when extra explanatory variables are introduced to the model is the adj.r.squared.

adj.r.squared is a corrected r.squared value that adjusts for the number of explanatory variables (\(p\)) in a model relative to the number of data points (\(n\)):

\(R^2_{adj} = 1-(1-R^2)\frac{n-1}{n-p-1}\)

Uses of adjusted R squared

adj.r.squared is therefore a more informative metric for the tasks of:

  1. Feature selection: Features that fail to increase substantially or even decrease the value of adj.r.squared do not contribute to its predictive power and we should consider removing them.

  2. Comparing models of the same response with different number of predictors 1.

Let’s take a look at how the residual sum of squares (RSS), r.squared and adj.r.squared change as a function of the number of model parameters in our example:

Rsq <- rep(0,15)
Rsq.adj <- rep(0,15) 
Res <- rep(0,15) 
SUMMARY <- data.frame(
  Rsquared = rep(0,15),
  Rsquared.adjusted=rep(0,15),
  RSS=rep(0.15)
)

for (np in 1:15){
fit <- lm(y ~ poly(x,np,raw=TRUE)) 
Rsq[np] <- summary(fit)$r.squared
Rsq.adj[np] <- summary(fit)$adj.r.squared
Res[np] <- sum(residuals(fit)^2)
}
par(mfrow=c(1,2))
plot (1:15, Rsq, col="skyblue4",  xlab="# parameters" , ylim=c(0,1), type="l", lwd=4)
lines (1:15, Rsq.adj, col="darkred", lwd=4)
legend(10,0.5,c("unadjust.","adjusted"),pch=19,col=c("skyblue4","darkred") )
plot  (1:15, Res, col="skyblue4", lwd=4, ylab="RSS", xlab="# parameters", type="l")

We see that the addition of polynomial degrees beyond 2 marginally increases r.squared values while it does not improve or even decreases the adj.r.squared values suggesting they do not contribute to predictive power and probably hurt generalization performance.

Assessing models and relevance of predictors using analysis of variance (ANOVA)

Is there a more principled way to assess whether specific predictors are beneficial to our model, or more generally, to compare models?

One can use the framework of Hypothesis testing we introduced in a previous lecture in order to test whether the difference in the residual sum of squares (RSS) resulting from additional features is significant enough to justify their inclusion in the model.

More generally we can evaluate the null hypothesis (H0) that the difference in RSS between two models is not sigificant, while accounting for the (possibly) different residual degrees of freedom between the models.

Low P-values indicate rejection of H0 and favor the model with the lower RSS.

This type of analysis is an example of a group of statistical hypothesis testing procedures collectively known as ANalysis Of VAriance (ANOVA). ANOVA tests evaluate the impact of one or more factors (e.g a categorical variable) by decomposing the total variance of a variable into variation originating from the factor(s) of interest and the rest. (actually the t-test we introduced in earlier lectures is another special cace of an ANOVA test )

In the case of the application of ANOVA for model comparison the factor of interest it the model class and the variable whose variation we evaluate is the RSS.

# simulate data (only x1 is truly correlated to y)
anova(fit1 , fit2)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1     23 1702.28                                  
2     22   66.68  1    1635.6 539.61 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit2 , fit3)
Analysis of Variance Table

Model 1: y ~ poly(x, 2, raw = TRUE)
Model 2: y ~ poly(x, 15, raw = TRUE)
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1     22 66.683                          
2     11 26.705 11    39.978 1.497 0.2572
anova(fit1, fit2 , fit3)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
Model 3: y ~ poly(x, 15, raw = TRUE)
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     23 1702.28                                   
2     22   66.68  1   1635.59 673.718 3.211e-11 ***
3     11   26.70 11     39.98   1.497    0.2572    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 1. (Rsquared, adj.Rsquared, ANOVA)

Consider a reponse variable y and predictor variables x1, x2, x3, that are generated as follows:

yy <- rnorm(100)
xx1 <- yy + rnorm(100)
xx2 <- rnorm(100)
xx3 <- rnorm(100)

A. Evaluate the models that use only x1, x1+x2, or x1+x2+x3 to predict y using the r.squared, adj.r.squared matrics as well as using ANOVA testing.

B. Repeat but this time using variables that are generated as folllows:

yy <- rnorm(100)
xx1 <- yy + rnorm(100)
xx2 <-  2*xx1 + rnorm(100)
xx3 <-  -xx1 + rnorm(100)
# simulate data (only x1 is truly correlated to y)
set.seed(1)
yy <- rnorm(100)
xx1 <- yy + rnorm(100)
xx2 <- rnorm(100)
xx3 <- rnorm(100)
summary(lm(yy ~ xx1))$r.squared
[1] 0.4673515
summary(lm(yy ~ xx1 + xx2))$r.squared
[1] 0.4685423
summary(lm(yy ~ xx1 + xx2 + xx3))$r.squared
[1] 0.4685444
# adjusted r-squared instead of r-squared
summary(lm(yy ~ xx1))$adj.r.squared
[1] 0.4619164
summary(lm(yy ~ xx1 + xx2))$adj.r.squared
[1] 0.4575844
summary(lm(yy ~ xx1 + xx2 + xx3))$adj.r.squared
[1] 0.4519364
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2))
Analysis of Variance Table

Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 42.542                           
2     97 42.447  1  0.095103 0.2173 0.6421
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2+ xx3))
Analysis of Variance Table

Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2 + xx3
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 42.542                           
2     96 42.447  2  0.095275 0.1077  0.898
# simulate data (x1, x2, x3 are collinear)
set.seed(2)
yy <- rnorm(100)
xx1 <- yy + rnorm(100)
xx2 <- 2*xx1+rnorm(100)
xx3 <- -xx1+rnorm(100)
summary(lm(yy ~ xx1))$r.squared
[1] 0.5580554
summary(lm(yy ~ xx1 + xx2))$r.squared
[1] 0.5580577
summary(lm(yy ~ xx1 + xx2 + xx3))$r.squared
[1] 0.5581602
# adjusted r-squared instead of r-squared
summary(lm(yy ~ xx1))$adj.r.squared
[1] 0.5535458
summary(lm(yy ~ xx1 + xx2))$adj.r.squared
[1] 0.5489455
summary(lm(yy ~ xx1 + xx2 + xx3))$adj.r.squared
[1] 0.5443527
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2))
Analysis of Variance Table

Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2
  Res.Df    RSS Df Sum of Sq     F Pr(>F)
1     98 58.893                          
2     97 58.892  1 0.0003066 5e-04 0.9821
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2+ xx3))
Analysis of Variance Table

Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2 + xx3
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     98 58.893                           
2     96 58.879  2  0.013966 0.0114 0.9887

Validation, test data and cross-validation

The above metrics, although useful in various degrees, do not address directly the point of model generalization. As implied in section 2, the only true and direct arbitrar of generalization performance of models is their evaluation on a new set of observations unseen during the training (fitting in the case of lm) process.

Depending on whether this data set is used for model selection or strictly to provide an independent evaluation of one final model this set of observations is respectively referred to as validation set or test set.

Typically, for purposes of model selection it is preferrable to use multiple different splits of the data in training and validation sets in order to reduce uncertainty in generalization performance estimates. This procedure of repeatedly partitioning the data into complementary training and validation subsets is referred to as cross-validation.

There are multiple strategies for performing this split 2. Here we will use a type of partitioning known as k-fold cross validation: the original sample is randomly partitioned into \(k\) equal subsamples. Of the \(k\) subsamples, a single one is retained as a validation set while the remaining \(k − 1\) subsamples are used as training data. The cross-validation process is then repeated \(k\) times, with each of the \(k\) subsamples used exactly once as validation data. The \(k\) results are finally averaged to produce a single estimation:

k-fold CV

Here we will use the function cvFit from the cvTools library in order to perform k-fold cross validation for our example:

library(cvTools)
DF <- data.frame(y=y,x=x)
CV.MSE_sqrt <- rep(0,12)
for (np in 1:12){
fit <- lm(y ~ poly(x,np,raw=TRUE),  data=DF) 
CV.MSE_sqrt[np] <- cvFit(fit,data=DF,y=DF$y,K=10,R=10)$cv

}

plot (1:12, CV.MSE_sqrt, col="skyblue4",  xlab="# of parameters" , type="l", lwd=4, main="Cross-validation error")

Avoiding overfitting

Overfitting is a symptom of two afflictions: An overly complex model and a small set of training observations.

This suggest two possible routes for curing overfitting:

  • Increase the training set and/or
  • favor simpler models
How much data is enough to avoid overfitting?

As a rule of thumb, one should have at least 20-fold more observations than predictors.

Increasing the training set

Increasing the training set is a trivial but very effective way of improving model performance. .

Let’s see what the effect of a large training set is on our example of polynomial fits:

npoints <- 300
set.seed(1)
xL <- runif(npoints, 0, 11)
yL <- (2-xL) * (xL-9) + rnorm(npoints,0,2.5)
set.seed(2)
xL_val <- runif(npoints, 0, 11)
yL_val <- (2-xL_val) * (xL_val-9) + rnorm(npoints,0,2.5)

Val_MSE_sqrt <- rep(0,15)
for (np in 1:15){
fit <- lm(yL ~ poly(xL,np,raw=TRUE)) 
yL_val.pred <- predict(fit, data.frame(xL=xL_val))
Val_MSE_sqrt[np] <-  sqrt(sum((yL_val.pred-yL_val)^2))
}
par(mfrow=c(1,2))
plot(xL,yL,pch=19,col="skyblue4", main="Fit for 15th order polyn.") 
lines(seqx, predict(fit, data.frame(xL = seqx)), col="green", lwd=2)
plot (c(1:15), Val_MSE_sqrt, col="skyblue4",  xlab="# parameters" , type="l", lwd=4, main="Validation error")

Favoring simpler models

Another effective way of discouraging model-overfitting is opting for models of reduced complexity. Fewer free model-parameters typically tend to decrease the variance of the model and classical linear regression is no exception to this rule 3.

One option for reducing model complexity, and therefore overfitting potential, is to carefully select as predictors only features that have a strong relationship with the outcome or to use model-assessment metrics that take into consideration model complexity. For example, one possible strategy for model/feature selection is to discard features that do not improve the adj.r.squared value of our model or do not result in significantly better models when evaluated using ANOVA testing.

There exist also more model-assessment metrics (e.g AIC, BIC) that are similar in spirit to adj.r.squared in that they take into consideration the number of model parameters in order to provide a better estimate of generalization performance 4.

However, sometimes it is can be quite impractical or computationally prohibitive to evaluate the impact of all possible features and their combinations in terms of those metrics.

One alternative is to include all available features in our model and modify the objective of the fitting procedure by penalizing for the number / magnitude of coefficients in our model. This approach is known as regularization and its incorporation in linear regression is referred to as regularized linear regression*.

The most widely-used regularized linear regression variant is LASSO regression (also known L1 regularized regression). Remember that in classical linear regression the fitting objective is to select cofficients \({\beta_i}\) that minimize the residual error:

\(\min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2\)

Instead in LASSO regression we want to minimize:

\(\min_{\beta_i}\{ \sum_{j=1}^{n} (y_j -\hat{y_j})^2 +\lambda \sum_{i=1}^{p} \vert \beta_i \vert \}\)

The second term in the above objective function is a coefficient penalty that can be modulated by adjusting the regularization parameter \(\lambda\). Higher values of \(\lambda\) result in sparser models with fewer non-zero coefficients \({\beta_i}\). When \(\lambda\) is set to zero we get back the classical linear regression fitting objective.

library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
#Use cross-validation to select lambda:
lambdas=10^seq(2, -5, by = -.1)
LASSO_reg <- cv.glmnet(poly(x,np,raw=TRUE), y, alpha=1, nfolds=10, lambda=lambdas)

par(mfrow=c(1,2))
plot (log10(LASSO_reg$lambda), sqrt(LASSO_reg$cvm), col="skyblue4" , type="l", lwd=4, main="CV error", xlab="Log lambda", ylab="MSE_sqrt")

lambda_best <- LASSO_reg$lambda.min 
#fit:
LASSO_fit <- glmnet(poly(x,np,raw=TRUE), y, alpha = 1, lambda = lambda_best)

# Check coefficients, predict and show fit:
LASSO_fit$beta
15 x 1 sparse Matrix of class "dgCMatrix"
              s0
1   8.043890e+00
2  -4.350079e-01
3  -3.225474e-02
4   .           
5   .           
6   .           
7   .           
8   .           
9   .           
10  .           
11  .           
12  .           
13  .           
14  .           
15  3.930175e-16
plot(x,y,pch=19,col="skyblue4", main="LASSO fit") 
lines(seqx, predict(LASSO_fit, newx= poly(seqx,np,raw=TRUE) ), col="Tomato", lwd=2) 

We can see that lasso regression results in a model with very few non-zero coefficients, a model essentially identical with the second degree polynomial model.

Exercise 2

In this exercise we will use once more the Iris dataset introduced in the previous lecture. First create a linear model that predicts Petal.Width using Sepal.Width as a predictor.

  • Does the addition of Species to Sepal.Width significantly improve the prediction of Petal.Width?
  • Does the addition of interaction terms significantly improve the prediction of Petal.Width?
# fit models
m1 <- lm(Petal.Width ~ Sepal.Width, data = iris)
m2 <- lm(Petal.Width ~ Sepal.Width + Species, data = iris)
m3 <- lm(Petal.Width ~ Sepal.Width * Species, data = iris)
# compare models
anova(m1, m2)
Analysis of Variance Table

Model 1: Petal.Width ~ Sepal.Width
Model 2: Petal.Width ~ Sepal.Width + Species
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    148 74.965                                  
2    146  4.794  2    70.172 1068.6 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2, m3)
Analysis of Variance Table

Model 1: Petal.Width ~ Sepal.Width + Species
Model 2: Petal.Width ~ Sepal.Width * Species
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    146 4.7935                                  
2    144 4.2135  2      0.58 9.9109 9.276e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 3

In this exercise we will use the melonoma dataset to practice some aspects of model building and model selection.

A. Load the dataset and split the observations in a training set that will contain 155 observations and a validation set containing the remaining (50) observations.

B. Construct a multilinear model for predicting melanoma thickness using all the other 4 individual variables as predictors. Remember to use the TRAIN subset of the data for fitting.

Take a look at the model matrix used internally by lm.

Which of the four predictors is more strongly associated with thickness?

Which is the one that appears to be more weakly associated?

Visualize the predictions of your model against the actual thickness values for both the training and the validation data sets.

C. Now add interactions between predictors to your model. Remember that you can add interaction term using either : or * which enumerates all possible interaction terms in your model.

Start by adding only one interaction term between sex and ulcer.

Then define a model that contains all possible interaction terms.

Compare the three models (simple mulitilinear, multilinear + 1 interaction term, multilinear with all interaction terms) in terms, of the r.squared, adj.r.squared and ANOVA testing results.

Finally compare the three models in terms of the validation error. Use the squared residuals sum as a metric:

\(RSS=\sum_{j}(y_j - \hat{y_j})^2\)

Do you see an improvement when including the single interaction term?

How about when including all possible interaction terms?

If you see a deterioration in performance what do you think is the reason behind it?

mel <- read.delim("data/melanoma_data.txt")
summary(mel)
      time          status          sex              age       
 Min.   :  10   Min.   :1.00   Min.   :0.0000   Min.   : 4.00  
 1st Qu.:1525   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:42.00  
 Median :2005   Median :2.00   Median :0.0000   Median :54.00  
 Mean   :2153   Mean   :1.79   Mean   :0.3854   Mean   :52.46  
 3rd Qu.:3042   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:65.00  
 Max.   :5565   Max.   :3.00   Max.   :1.0000   Max.   :95.00  
      year        thickness         ulcer      
 Min.   :1962   Min.   : 0.10   Min.   :0.000  
 1st Qu.:1968   1st Qu.: 0.97   1st Qu.:0.000  
 Median :1970   Median : 1.94   Median :0.000  
 Mean   :1970   Mean   : 2.92   Mean   :0.439  
 3rd Qu.:1972   3rd Qu.: 3.56   3rd Qu.:1.000  
 Max.   :1977   Max.   :17.42   Max.   :1.000  
set.seed(5)
smpl <- sample(1:nrow(mel), 50)
TRAIN <- mel[-smpl,]
VALID  <- mel[smpl,]
head(model.matrix(~status+ sex + age + ulcer, data = TRAIN))
  (Intercept) status sex age ulcer
1           1      3   1  76     1
2           1      3   1  56     0
3           1      2   1  41     0
4           1      3   0  71     0
5           1      1   1  52     1
6           1      1   1  28     1
par(mfrow=c(1,2))
fit_multiv <- lm(thickness ~ status + sex + age + ulcer, data=TRAIN)
summary(fit_multiv)

Call:
lm(formula = thickness ~ status + sex + age + ulcer, data = TRAIN)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9298 -1.4596 -0.6726  0.4894 12.4579 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.66964    1.08405   2.463 0.014923 *  
status      -1.18990    0.42401  -2.806 0.005677 ** 
sex          0.76941    0.45745   1.682 0.094655 .  
age          0.02595    0.01350   1.922 0.056531 .  
ulcer        1.79550    0.48192   3.726 0.000275 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.752 on 150 degrees of freedom
Multiple R-squared:  0.2306,    Adjusted R-squared:  0.2101 
F-statistic: 11.24 on 4 and 150 DF,  p-value: 5.286e-08
pred_multiv <- predict(fit_multiv, TRAIN )
plot(TRAIN$thickness, pred_multiv, pch = "*", col="red")
pred_multiv_test <- predict(fit_multiv, VALID )
plot(VALID$thickness, pred_multiv_test, pch = "*", col="red")

par(mfrow=c(2,2))
fit_multiv_int1 <- lm(thickness ~ status + sex + age + ulcer + sex:ulcer, data=TRAIN)
summary(fit_multiv_int1)

Call:
lm(formula = thickness ~ status + sex + age + ulcer + sex:ulcer, 
    data = TRAIN)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0757 -1.4590 -0.6630  0.4787 12.6343 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  2.70252    1.08692   2.486  0.01401 * 
status      -1.16031    0.42684  -2.718  0.00734 **
sex          0.44883    0.64799   0.693  0.48960   
age          0.02622    0.01353   1.937  0.05460 . 
ulcer        1.53947    0.60576   2.541  0.01206 * 
sex:ulcer    0.64469    0.92139   0.700  0.48521   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.757 on 149 degrees of freedom
Multiple R-squared:  0.2331,    Adjusted R-squared:  0.2074 
F-statistic: 9.059 on 5 and 149 DF,  p-value: 1.536e-07
pred_multiv_int1 <- predict(fit_multiv_int1, TRAIN )
plot(TRAIN$thickness, pred_multiv_int1, pch = "*", col="red")
pred_multiv_valid_int1 <- predict(fit_multiv_int1, VALID )
plot(VALID$thickness, pred_multiv_valid_int1, pch = "*", col="red")

fit_multiv_intA <- lm(thickness ~ status * sex * age * ulcer, data=TRAIN)
summary(fit_multiv_intA)

Call:
lm(formula = thickness ~ status * sex * age * ulcer, data = TRAIN)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7433 -1.3598 -0.5319  0.5834 11.3838 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)   
(Intercept)            8.60745    6.17420   1.394  0.16551   
status                -4.07618    3.24814  -1.255  0.21161   
sex                  -15.50559   10.48788  -1.478  0.14156   
age                   -0.10457    0.10424  -1.003  0.31753   
ulcer                -16.11057    8.04644  -2.002  0.04721 * 
status:sex             8.13138    5.52054   1.473  0.14303   
status:age             0.06471    0.05469   1.183  0.23874   
sex:age                0.33742    0.18174   1.857  0.06549 . 
status:ulcer           9.24986    4.40293   2.101  0.03746 * 
sex:ulcer             26.53865   12.38508   2.143  0.03387 * 
age:ulcer              0.35401    0.13450   2.632  0.00945 **
status:sex:age        -0.17161    0.09619  -1.784  0.07660 . 
status:sex:ulcer     -12.56049    6.82652  -1.840  0.06791 . 
status:age:ulcer      -0.18545    0.07308  -2.538  0.01226 * 
sex:age:ulcer         -0.54850    0.21219  -2.585  0.01077 * 
status:sex:age:ulcer   0.27148    0.11612   2.338  0.02082 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.721 on 139 degrees of freedom
Multiple R-squared:  0.3032,    Adjusted R-squared:  0.228 
F-statistic: 4.032 on 15 and 139 DF,  p-value: 4.724e-06
pred_multiv_intA <- predict(fit_multiv_intA, TRAIN )
plot(TRAIN$thickness, pred_multiv_intA, pch = "*", col="red")
pred_multiv_valid_intA <- predict(fit_multiv_intA, VALID )
plot(VALID$thickness, pred_multiv_valid_intA, pch = "*", col="red")

anova(fit_multiv, fit_multiv_int1)
Analysis of Variance Table

Model 1: thickness ~ status + sex + age + ulcer
Model 2: thickness ~ status + sex + age + ulcer + sex:ulcer
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    150 1136.3                           
2    149 1132.6  1    3.7214 0.4896 0.4852
anova(fit_multiv_int1, fit_multiv_intA)
Analysis of Variance Table

Model 1: thickness ~ status + sex + age + ulcer + sex:ulcer
Model 2: thickness ~ status * sex * age * ulcer
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    149 1132.6                           
2    139 1029.1 10    103.48 1.3977 0.1874
par(mfrow=c(1,1))
MSE_train_multiv <- mean(residuals(fit_multiv)^2)
MSE_valid_multiv <- mean((pred_multiv_test - VALID$thickness)^2)

MSE_train_multiv_int1 <- mean(residuals(fit_multiv_int1)^2)
MSE_valid_multiv_int1 <- mean((pred_multiv_valid_int1 - VALID$thickness)^2)


MSE_train_multiv_intA <- mean(residuals(fit_multiv_intA)^2)
MSE_valid_multiv_intA <- mean((pred_multiv_valid_intA - VALID$thickness)^2)

MSEs <- rbind( c(MSE_train_multiv, MSE_train_multiv_int1, MSE_train_multiv_intA),c(MSE_valid_multiv,MSE_valid_multiv_int1,MSE_valid_multiv_intA) )
barplot(MSEs,beside=TRUE,ylim=c(0,10), names.arg = c("TR_M","TS_M","TR_MI1","TS_MI1","TR_MIA","TS_MIA"),las=2, ylab="MSE")

BONUS: Lasso regression with interaction terms

The code below illustrates how to implement lasso regression including as predictors interaction terms using as example the melanoma dataset:

lambdas <- 10^seq(2, -5, by = -.1)
f <- as.formula(thickness ~ status * sex * age * ulcer+0)

x <- model.matrix(f, TRAIN)
y <- as.matrix(TRAIN$thickness, ncol=1)

LASSO_regr <- cv.glmnet(x, y, alpha=1, nfolds=10, lambda=lambdas)


lambda_best <- LASSO_regr$lambda.min 

g <- glmnet(x, y, lambda=lambda_best)
print(coef(g))
16 x 1 sparse Matrix of class "dgCMatrix"
                               s0
(Intercept)           3.347973929
status               -1.211868330
sex                   0.310903562
age                   0.017431402
ulcer                 0.945514137
status:sex            .          
status:age            .          
sex:age               .          
status:ulcer          .          
sex:ulcer             .          
age:ulcer             0.008699109
status:sex:age        .          
status:sex:ulcer      0.469772793
status:age:ulcer      .          
sex:age:ulcer         .          
status:sex:age:ulcer  .          

Footnotes

  1. However, consider what happens to the value of adj.r.squared when a model nearly perfectly interpolates the data. Can it still be trusted as a metric for these tasks?↩︎

  2. For more information on cross-validation check the wikipedia article: cross validation.↩︎

  3. There exist some notable exceptions where extremely over-parameterized (\(p>n\)) models combined with specific model-fitting procedures exhibit the counter-intuitive behavior of decreasing in variance as parameters increase. For those interested in this fascinating phenomenon:
    Bias-variance tradeoff: A modern perspective
    Benign overfitting↩︎

  4. AIC/BIC and model selection
    R implementation of AIC/BIC↩︎