set.seed(1)
<- 25
npoints <- runif(npoints, 0, 11)
x <- (2-x) * (x-9) + rnorm(npoints,0,2.5)
y plot(x,y,pch=19,col="skyblue4")
Model selection
Recap on linear regression
What is linear regression
- Generally, a model formalizes the relationship between variables: \(y = f(x_i)\)
Notation: \(y\) and \(x_i\) (\(x_1\), \(x_2\), …) are random variables - By defining \(f(\ldots)\) we formalize an assumption about the type of relationship
- linear models model or predict a variable (\(y\)) as a linear combination of one or several explanatory variables (\(x_i\)):
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n\)
- \(y\) is the response variable or dependent variable (it depends on the \(x_i\))
- \(x_i\) are the independent variables or predictors (they allow prediction of \(y\))
- \(\beta_i\) are the parameters to fit (coefficients).
In the simple case of \(y = \beta_0 + \beta_1x_1\), \(\beta_0\) is the intercept and \(\beta_1\) the slope)
- The relationship between \(y\) and the \(x_i\) can be formally written as:
\(y = \beta_0 + \sum_{i}{\beta_i x_i} + \epsilon\)
or in matrix notation:
\(y = \beta_0 + \beta \cdot X + \epsilon\)
- \(\epsilon\) contains the errors (e.g unkown influences, measurement errors) or residuals, often assumed to follow a normal distribution with mean zero: \(\epsilon \sim \mathcal{N}(0,\sigma^2)\)
- \(\epsilon\) simply is the difference between the observed values \(y\) and the corresponding predicted values \(\hat{y} = \beta_0 + \sum_{i}{\beta_i x_i}\)
\(\epsilon = y - \hat{y}\)
- “Fitting a linear model” means finding values for \(\beta_i\) that minimize the residuals, for example the sum of their squares
\(\min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2\)
Linear regression can be used for:
Predictive modeling that is, training a model on observed data for use in prediction of unobserved \(y\) using new values of \(x_i\).
Studying the association of \(x_i\) to \(y\) (statistical modeling: Given \(y\) and several \(x_1, x_2, \ldots, x_n\) that may be related to \(y\), (linear) models can be applied to quantify the strength of the association between \(x_i\) and \(y\): An \(x_i\) that is strongly associated to \(y\) is a strong predictor for \(y\) and the predictive performance of a model that includes \(x_i\) will be significantly better compared to the model without \(x_i\) as predictor.
Creating and interpreting linear models in R
In R, linear models are created using
lm()
For a model \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\), the formula is of the form:
y ~ x1 + x2
- multiple regressors are combined using
+
- the intercept (constant term) \(\beta_0\) is implicitly included, unless you say
y ~ x1 + x2 + 0
- multiple regressors are combined using
Interactions between predictors can be added to the model using either
:
or*
which enumerates all possible interaction terms in your model.- e.g :
fit <- lm(y ~x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3, data = DF)
or simply:fit <- lm(y ~x1 * x2 * x3, data = DF)
- e.g :
lm()
returns an R object of classlm
, for which there are a number of methods available:coefficients(fit)
give you back the \(\beta_i\) (e.g. slope and intercept terms in the case of a single predictor).
residuals(fit)
give you back the \(\epsilon\) values. You could for example calculate the residuals sum of squares usingsum(residuals(fit)^2)
.
predict(fit)
give you back the fitted values for the response (the \(\hat{y}\)). You can also use this function to predict the response for new observations \(x_i\), using:predict(fit, newdata=data.frame)
.summary(fit)
is a great summary of the overall results of the regression.- The
lm
object has its ownplot
method, which produces plots to assess the quality of a linear model
Principles of model selection
Model selection refers to the task of selecting a good / the best model among a set of candidate models. Model selection is a critical part of an analysis since a flawed model will lead to incorrect predictions and conclusions.
But what exactly do we mean by “goodness” of a model? Informally we can say that our model needs to be flexible enough to capture the underlying relationship between predictors and response but at the same time should not introduce unecessary complications.
This is in essence the principle known also as Occam’s razor which states that among candidate models of similar explanatory power preference should be given to the simpler model-choices.
Let’s discuss this guiding principle while looking at a motivating example. We want to model the relationship between x
and y
give the data below:
Let’s first try to fit a model to this data using a univariate linear model:
<- lm(y ~ x)
fit1 <- seq(0, 11, length.out = 128)
seqx plot(x,y,pch=19,col="skyblue4")
lines(seqx, predict(fit1, data.frame(x = seqx)), col="darkred", lwd=2)
Obviously this model is a poor fit to the data. Assuming a linear relationship is too simplistic and therefore our model lacks the ability to capture the underlying relationship between response and predictors.
The previous is an example of underfitting .
The error arising as a result of overly simplistic or more generally of incorrect modelling assumptions is called bias.
Now let’s add to out model predictors that are different polynomial degrees of x
. Each added polynomial term is an extra explanatory variable that bestows our model with increasing modelling capacity.
We can specify to lm a second degree polynomial ( \(y = \alpha + \beta_1 x + \beta_2 x^2\) ) as: -lm(y ~ x + I(x^2))
or lm(y ~ poly(x,2, raw=TRUE))
and in general a polynomial of degree n
( \(y = \alpha + \beta_1 x + \beta_2 x^2 + \ldots + \beta_n x^n\) ) as: lm(y ~ x + I(x^2)) + ... + I(x^n)
or lm(y ~ poly(x,n, raw=TRUE))
Lets look at the linear model that tries to model the response using the first two polynomial degrees of x
:
# A model of the form y= a0 + a1x + a2x^2 :
<- lm(y ~ poly(x,2, raw=TRUE))
fit2 plot(x,y,pch=19,col="skyblue4")
lines(seqx, predict(fit2, data.frame(x = seqx)), col="darkorange", lwd=2)
This instictively looks like a reasonable model choice for our observations. It seems to capture the regularities of the process that gave rise to the data while at the same time being a relatively simple model that only requires two parameters to be fully specified.
# A model of the form y= a0 + a1x + a2x^2 + ... +a15x^15 :
<- lm(y ~ poly(x,15,raw=TRUE))
fit3 plot(x,y,pch=19,col="skyblue4")
lines(seqx, predict(fit3, data.frame(x = seqx)), col="green", lwd=2)
This does not look like a good choice. Our model gives us a curve that matches very closely (interpolates) the observations but it does so at the cost of being overly sensitive to small fluctuations of the predictors.
The error arising from high sensitivity to the inputs is referred to as variance.
High-variance models capture incosequential details of the observations that are probably the result of noise and therefore are unique to the training data and unlikely to be present in a new set of observation generated by the same process. As a result they miss the “big picture” lowering their performance on new data.
We say that such models overfit the training data and do not generalize well on new observations.
Let’s see in practice how our three models perform a new set of observations:
set.seed(8)
<- runif(npoints, 0, 11)
newx <- (2-newx) * (newx-9) + rnorm(npoints,0,2.5)
newy
plot(newx,newy,pch=19,col="skyblue4")
lines(seqx, predict(fit1, data.frame(x = seqx)), col="red", lwd=2)
lines(seqx, predict(fit2, data.frame(x = seqx)), col="darkorange", lwd=2)
lines(seqx, predict(fit3, data.frame(x = seqx)), col="green", lwd=2)
We can now formalize a bit more our sentiment of what constitutes “bad” and “good” models:
- The sources of error that depend on model selection can be decomposed in bias and variance errors.
- Our model selection should control for both types of errors in order to strike a good balance between modelling capacity and potential to overfit.
- A low bias, low variance model fits well the training data but also performs well on unseen data, that is, it is able to generalize.
Comparing linear models and assessing the value of predictors
Residuals, r.squared and adj.r.squared
In most classes of models - including the case of linear regression - the addition of model parameters either in the form of more predictors or more interaction terms reduces bias and tends to improve the fit to the training data. This is reflected as a reduction of the residuals
and an inflation of the r-squared value
. This reduction in bias also means that the modelling capacity grows (i.e its variance increases) making our model more prone to overfitting and loss of generalization.
For example for the three linear models discussed above:
summary(fit1)$r.squared
[1] 0.1039079
summary(fit2)$r.squared
[1] 0.9648976
summary(fit3)$r.squared
[1] 0.9859424
sum(residuals(fit1)^2)
[1] 1702.277
sum(residuals(fit2)^2)
[1] 66.68294
sum(residuals(fit3)^2)
[1] 26.70486
Features that result in imperceptible improvements in terms of those metrics are unlikely to boost model predictive performance. However, looking at the residuals
and r.squared
values is not in itself enough to assess the quality of our model as they evaluate only one aspect of it (the fit to the training data) but does not tell us much about its ability to generalize; these metrics do not account in any way for the increase in model complexity and variance resulting from additional explanatory variables.
One measure that attempts to account for the spurious increase in r.squared
when extra explanatory variables are introduced to the model is the adj.r.squared
.
adj.r.squared
is a corrected r.squared
value that adjusts for the number of explanatory variables (\(p\)) in a model relative to the number of data points (\(n\)):
\(R^2_{adj} = 1-(1-R^2)\frac{n-1}{n-p-1}\)
adj.r.squared
is therefore a more informative metric for the tasks of:
Feature selection: Features that fail to increase substantially or even decrease the value of
adj.r.squared
do not contribute to its predictive power and we should consider removing them.Comparing models of the same response with different number of predictors 1.
Let’s take a look at how the residual sum of squares (RSS), r.squared
and adj.r.squared
change as a function of the number of model parameters in our example:
<- rep(0,15)
Rsq <- rep(0,15)
Rsq.adj <- rep(0,15)
Res <- data.frame(
SUMMARY Rsquared = rep(0,15),
Rsquared.adjusted=rep(0,15),
RSS=rep(0.15)
)
for (np in 1:15){
<- lm(y ~ poly(x,np,raw=TRUE))
fit <- summary(fit)$r.squared
Rsq[np] <- summary(fit)$adj.r.squared
Rsq.adj[np] <- sum(residuals(fit)^2)
Res[np]
}par(mfrow=c(1,2))
plot (1:15, Rsq, col="skyblue4", xlab="# parameters" , ylim=c(0,1), type="l", lwd=4)
lines (1:15, Rsq.adj, col="darkred", lwd=4)
legend(10,0.5,c("unadjust.","adjusted"),pch=19,col=c("skyblue4","darkred") )
plot (1:15, Res, col="skyblue4", lwd=4, ylab="RSS", xlab="# parameters", type="l")
We see that the addition of polynomial degrees beyond 2 marginally increases r.squared
values while it does not improve or even decreases the adj.r.squared
values suggesting they do not contribute to predictive power and probably hurt generalization performance.
Assessing models and relevance of predictors using analysis of variance (ANOVA)
Is there a more principled way to assess whether specific predictors are beneficial to our model, or more generally, to compare models?
One can use the framework of Hypothesis testing we introduced in a previous lecture in order to test whether the difference in the residual sum of squares (RSS) resulting from additional features is significant enough to justify their inclusion in the model.
More generally we can evaluate the null hypothesis (H0) that the difference in RSS between two models is not sigificant, while accounting for the (possibly) different residual degrees of freedom between the models.
Low P-values indicate rejection of H0 and favor the model with the lower RSS.
This type of analysis is an example of a group of statistical hypothesis testing procedures collectively known as ANalysis Of VAriance (ANOVA). ANOVA tests evaluate the impact of one or more factors (e.g a categorical variable) by decomposing the total variance of a variable into variation originating from the factor(s) of interest and the rest. (actually the t-test
we introduced in earlier lectures is another special cace of an ANOVA test )
In the case of the application of ANOVA for model comparison the factor of interest it the model class and the variable whose variation we evaluate is the RSS.
# simulate data (only x1 is truly correlated to y)
anova(fit1 , fit2)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 23 1702.28
2 22 66.68 1 1635.6 539.61 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit2 , fit3)
Analysis of Variance Table
Model 1: y ~ poly(x, 2, raw = TRUE)
Model 2: y ~ poly(x, 15, raw = TRUE)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 22 66.683
2 11 26.705 11 39.978 1.497 0.2572
anova(fit1, fit2 , fit3)
Analysis of Variance Table
Model 1: y ~ x
Model 2: y ~ poly(x, 2, raw = TRUE)
Model 3: y ~ poly(x, 15, raw = TRUE)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 23 1702.28
2 22 66.68 1 1635.59 673.718 3.211e-11 ***
3 11 26.70 11 39.98 1.497 0.2572
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise 1. (Rsquared, adj.Rsquared, ANOVA)
Consider a reponse variable y
and predictor variables x1, x2, x3
, that are generated as follows:
<- rnorm(100)
yy <- yy + rnorm(100)
xx1 <- rnorm(100)
xx2 <- rnorm(100) xx3
A. Evaluate the models that use only x1
, x1+x2
, or x1+x2+x3
to predict y
using the r.squared
, adj.r.squared
matrics as well as using ANOVA
testing.
B. Repeat but this time using variables that are generated as folllows:
<- rnorm(100)
yy <- yy + rnorm(100)
xx1 <- 2*xx1 + rnorm(100)
xx2 <- -xx1 + rnorm(100) xx3
# simulate data (only x1 is truly correlated to y)
set.seed(1)
<- rnorm(100)
yy <- yy + rnorm(100)
xx1 <- rnorm(100)
xx2 <- rnorm(100)
xx3 summary(lm(yy ~ xx1))$r.squared
[1] 0.4673515
summary(lm(yy ~ xx1 + xx2))$r.squared
[1] 0.4685423
summary(lm(yy ~ xx1 + xx2 + xx3))$r.squared
[1] 0.4685444
# adjusted r-squared instead of r-squared
summary(lm(yy ~ xx1))$adj.r.squared
[1] 0.4619164
summary(lm(yy ~ xx1 + xx2))$adj.r.squared
[1] 0.4575844
summary(lm(yy ~ xx1 + xx2 + xx3))$adj.r.squared
[1] 0.4519364
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2))
Analysis of Variance Table
Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 42.542
2 97 42.447 1 0.095103 0.2173 0.6421
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2+ xx3))
Analysis of Variance Table
Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2 + xx3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 42.542
2 96 42.447 2 0.095275 0.1077 0.898
# simulate data (x1, x2, x3 are collinear)
set.seed(2)
<- rnorm(100)
yy <- yy + rnorm(100)
xx1 <- 2*xx1+rnorm(100)
xx2 <- -xx1+rnorm(100)
xx3 summary(lm(yy ~ xx1))$r.squared
[1] 0.5580554
summary(lm(yy ~ xx1 + xx2))$r.squared
[1] 0.5580577
summary(lm(yy ~ xx1 + xx2 + xx3))$r.squared
[1] 0.5581602
# adjusted r-squared instead of r-squared
summary(lm(yy ~ xx1))$adj.r.squared
[1] 0.5535458
summary(lm(yy ~ xx1 + xx2))$adj.r.squared
[1] 0.5489455
summary(lm(yy ~ xx1 + xx2 + xx3))$adj.r.squared
[1] 0.5443527
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2))
Analysis of Variance Table
Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 58.893
2 97 58.892 1 0.0003066 5e-04 0.9821
anova(lm(yy ~ xx1), lm(yy ~ xx1 + xx2+ xx3))
Analysis of Variance Table
Model 1: yy ~ xx1
Model 2: yy ~ xx1 + xx2 + xx3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 98 58.893
2 96 58.879 2 0.013966 0.0114 0.9887
Validation, test data and cross-validation
The above metrics, although useful in various degrees, do not address directly the point of model generalization. As implied in section 2, the only true and direct arbitrar of generalization performance of models is their evaluation on a new set of observations unseen during the training (fitting in the case of lm
) process.
Depending on whether this data set is used for model selection or strictly to provide an independent evaluation of one final model this set of observations is respectively referred to as validation set or test set.
Typically, for purposes of model selection it is preferrable to use multiple different splits of the data in training and validation sets in order to reduce uncertainty in generalization performance estimates. This procedure of repeatedly partitioning the data into complementary training and validation subsets is referred to as cross-validation.
There are multiple strategies for performing this split 2. Here we will use a type of partitioning known as k-fold cross validation: the original sample is randomly partitioned into \(k\) equal subsamples. Of the \(k\) subsamples, a single one is retained as a validation set while the remaining \(k − 1\) subsamples are used as training data. The cross-validation process is then repeated \(k\) times, with each of the \(k\) subsamples used exactly once as validation data. The \(k\) results are finally averaged to produce a single estimation:
Here we will use the function cvFit
from the cvTools
library in order to perform k-fold cross validation for our example:
library(cvTools)
<- data.frame(y=y,x=x)
DF <- rep(0,12)
CV.MSE_sqrt for (np in 1:12){
<- lm(y ~ poly(x,np,raw=TRUE), data=DF)
fit <- cvFit(fit,data=DF,y=DF$y,K=10,R=10)$cv
CV.MSE_sqrt[np]
}
plot (1:12, CV.MSE_sqrt, col="skyblue4", xlab="# of parameters" , type="l", lwd=4, main="Cross-validation error")
Avoiding overfitting
Overfitting is a symptom of two afflictions: An overly complex model and a small set of training observations.
This suggest two possible routes for curing overfitting:
- Increase the training set and/or
- favor simpler models
As a rule of thumb, one should have at least 20-fold more observations than predictors.
Increasing the training set
Increasing the training set is a trivial but very effective way of improving model performance. .
Let’s see what the effect of a large training set is on our example of polynomial fits:
<- 300
npoints set.seed(1)
<- runif(npoints, 0, 11)
xL <- (2-xL) * (xL-9) + rnorm(npoints,0,2.5)
yL set.seed(2)
<- runif(npoints, 0, 11)
xL_val <- (2-xL_val) * (xL_val-9) + rnorm(npoints,0,2.5)
yL_val
<- rep(0,15)
Val_MSE_sqrt for (np in 1:15){
<- lm(yL ~ poly(xL,np,raw=TRUE))
fit <- predict(fit, data.frame(xL=xL_val))
yL_val.pred <- sqrt(sum((yL_val.pred-yL_val)^2))
Val_MSE_sqrt[np]
}par(mfrow=c(1,2))
plot(xL,yL,pch=19,col="skyblue4", main="Fit for 15th order polyn.")
lines(seqx, predict(fit, data.frame(xL = seqx)), col="green", lwd=2)
plot (c(1:15), Val_MSE_sqrt, col="skyblue4", xlab="# parameters" , type="l", lwd=4, main="Validation error")
Favoring simpler models
Another effective way of discouraging model-overfitting is opting for models of reduced complexity. Fewer free model-parameters typically tend to decrease the variance of the model and classical linear regression is no exception to this rule 3.
One option for reducing model complexity, and therefore overfitting potential, is to carefully select as predictors only features that have a strong relationship with the outcome or to use model-assessment metrics that take into consideration model complexity. For example, one possible strategy for model/feature selection is to discard features that do not improve the adj.r.squared
value of our model or do not result in significantly better models when evaluated using ANOVA
testing.
There exist also more model-assessment metrics (e.g AIC, BIC) that are similar in spirit to adj.r.squared
in that they take into consideration the number of model parameters in order to provide a better estimate of generalization performance 4.
However, sometimes it is can be quite impractical or computationally prohibitive to evaluate the impact of all possible features and their combinations in terms of those metrics.
One alternative is to include all available features in our model and modify the objective of the fitting procedure by penalizing for the number / magnitude of coefficients in our model. This approach is known as regularization and its incorporation in linear regression is referred to as regularized linear regression*.
The most widely-used regularized linear regression variant is LASSO regression (also known L1 regularized regression). Remember that in classical linear regression the fitting objective is to select cofficients \({\beta_i}\) that minimize the residual error:
\(\min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2\)
Instead in LASSO regression we want to minimize:
\(\min_{\beta_i}\{ \sum_{j=1}^{n} (y_j -\hat{y_j})^2 +\lambda \sum_{i=1}^{p} \vert \beta_i \vert \}\)
The second term in the above objective function is a coefficient penalty that can be modulated by adjusting the regularization parameter \(\lambda\). Higher values of \(\lambda\) result in sparser models with fewer non-zero coefficients \({\beta_i}\). When \(\lambda\) is set to zero we get back the classical linear regression fitting objective.
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
#Use cross-validation to select lambda:
=10^seq(2, -5, by = -.1)
lambdas<- cv.glmnet(poly(x,np,raw=TRUE), y, alpha=1, nfolds=10, lambda=lambdas)
LASSO_reg
par(mfrow=c(1,2))
plot (log10(LASSO_reg$lambda), sqrt(LASSO_reg$cvm), col="skyblue4" , type="l", lwd=4, main="CV error", xlab="Log lambda", ylab="MSE_sqrt")
<- LASSO_reg$lambda.min
lambda_best #fit:
<- glmnet(poly(x,np,raw=TRUE), y, alpha = 1, lambda = lambda_best)
LASSO_fit
# Check coefficients, predict and show fit:
$beta LASSO_fit
15 x 1 sparse Matrix of class "dgCMatrix"
s0
1 8.043890e+00
2 -4.350079e-01
3 -3.225474e-02
4 .
5 .
6 .
7 .
8 .
9 .
10 .
11 .
12 .
13 .
14 .
15 3.930175e-16
plot(x,y,pch=19,col="skyblue4", main="LASSO fit")
lines(seqx, predict(LASSO_fit, newx= poly(seqx,np,raw=TRUE) ), col="Tomato", lwd=2)
We can see that lasso regression results in a model with very few non-zero coefficients, a model essentially identical with the second degree polynomial model.
Exercise 2
In this exercise we will use once more the Iris
dataset introduced in the previous lecture. First create a linear model that predicts Petal.Width
using Sepal.Width
as a predictor.
- Does the addition of
Species
toSepal.Width
significantly improve the prediction ofPetal.Width
?
- Does the addition of interaction terms significantly improve the prediction of
Petal.Width
?
# fit models
<- lm(Petal.Width ~ Sepal.Width, data = iris)
m1 <- lm(Petal.Width ~ Sepal.Width + Species, data = iris)
m2 <- lm(Petal.Width ~ Sepal.Width * Species, data = iris)
m3 # compare models
anova(m1, m2)
Analysis of Variance Table
Model 1: Petal.Width ~ Sepal.Width
Model 2: Petal.Width ~ Sepal.Width + Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 148 74.965
2 146 4.794 2 70.172 1068.6 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m2, m3)
Analysis of Variance Table
Model 1: Petal.Width ~ Sepal.Width + Species
Model 2: Petal.Width ~ Sepal.Width * Species
Res.Df RSS Df Sum of Sq F Pr(>F)
1 146 4.7935
2 144 4.2135 2 0.58 9.9109 9.276e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Exercise 3
In this exercise we will use the melonoma dataset to practice some aspects of model building and model selection.
A. Load the dataset and split the observations in a training set that will contain 155 observations and a validation set containing the remaining (50) observations.
B. Construct a multilinear model for predicting melanoma thickness using all the other 4 individual variables as predictors. Remember to use the TRAIN subset of the data for fitting.
Take a look at the model matrix used internally by lm.
Which of the four predictors is more strongly associated with thickness?
Which is the one that appears to be more weakly associated?
Visualize the predictions of your model against the actual thickness values for both the training and the validation data sets.
C. Now add interactions between predictors to your model. Remember that you can add interaction term using either :
or *
which enumerates all possible interaction terms in your model.
Start by adding only one interaction term between sex
and ulcer
.
Then define a model that contains all possible interaction terms.
Compare the three models (simple mulitilinear, multilinear + 1 interaction term, multilinear with all interaction terms) in terms, of the r.squared
, adj.r.squared
and ANOVA
testing results.
Finally compare the three models in terms of the validation error. Use the squared residuals sum as a metric:
\(RSS=\sum_{j}(y_j - \hat{y_j})^2\)
Do you see an improvement when including the single interaction term?
How about when including all possible interaction terms?
If you see a deterioration in performance what do you think is the reason behind it?
<- read.delim("data/melanoma_data.txt")
mel summary(mel)
time status sex age
Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00
1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00
Median :2005 Median :2.00 Median :0.0000 Median :54.00
Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46
3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00
Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00
year thickness ulcer
Min. :1962 Min. : 0.10 Min. :0.000
1st Qu.:1968 1st Qu.: 0.97 1st Qu.:0.000
Median :1970 Median : 1.94 Median :0.000
Mean :1970 Mean : 2.92 Mean :0.439
3rd Qu.:1972 3rd Qu.: 3.56 3rd Qu.:1.000
Max. :1977 Max. :17.42 Max. :1.000
set.seed(5)
<- sample(1:nrow(mel), 50)
smpl <- mel[-smpl,]
TRAIN <- mel[smpl,] VALID
head(model.matrix(~status+ sex + age + ulcer, data = TRAIN))
(Intercept) status sex age ulcer
1 1 3 1 76 1
2 1 3 1 56 0
3 1 2 1 41 0
4 1 3 0 71 0
5 1 1 1 52 1
6 1 1 1 28 1
par(mfrow=c(1,2))
<- lm(thickness ~ status + sex + age + ulcer, data=TRAIN)
fit_multiv summary(fit_multiv)
Call:
lm(formula = thickness ~ status + sex + age + ulcer, data = TRAIN)
Residuals:
Min 1Q Median 3Q Max
-3.9298 -1.4596 -0.6726 0.4894 12.4579
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.66964 1.08405 2.463 0.014923 *
status -1.18990 0.42401 -2.806 0.005677 **
sex 0.76941 0.45745 1.682 0.094655 .
age 0.02595 0.01350 1.922 0.056531 .
ulcer 1.79550 0.48192 3.726 0.000275 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.752 on 150 degrees of freedom
Multiple R-squared: 0.2306, Adjusted R-squared: 0.2101
F-statistic: 11.24 on 4 and 150 DF, p-value: 5.286e-08
<- predict(fit_multiv, TRAIN )
pred_multiv plot(TRAIN$thickness, pred_multiv, pch = "*", col="red")
<- predict(fit_multiv, VALID )
pred_multiv_test plot(VALID$thickness, pred_multiv_test, pch = "*", col="red")
par(mfrow=c(2,2))
<- lm(thickness ~ status + sex + age + ulcer + sex:ulcer, data=TRAIN)
fit_multiv_int1 summary(fit_multiv_int1)
Call:
lm(formula = thickness ~ status + sex + age + ulcer + sex:ulcer,
data = TRAIN)
Residuals:
Min 1Q Median 3Q Max
-4.0757 -1.4590 -0.6630 0.4787 12.6343
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.70252 1.08692 2.486 0.01401 *
status -1.16031 0.42684 -2.718 0.00734 **
sex 0.44883 0.64799 0.693 0.48960
age 0.02622 0.01353 1.937 0.05460 .
ulcer 1.53947 0.60576 2.541 0.01206 *
sex:ulcer 0.64469 0.92139 0.700 0.48521
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.757 on 149 degrees of freedom
Multiple R-squared: 0.2331, Adjusted R-squared: 0.2074
F-statistic: 9.059 on 5 and 149 DF, p-value: 1.536e-07
<- predict(fit_multiv_int1, TRAIN )
pred_multiv_int1 plot(TRAIN$thickness, pred_multiv_int1, pch = "*", col="red")
<- predict(fit_multiv_int1, VALID )
pred_multiv_valid_int1 plot(VALID$thickness, pred_multiv_valid_int1, pch = "*", col="red")
<- lm(thickness ~ status * sex * age * ulcer, data=TRAIN)
fit_multiv_intA summary(fit_multiv_intA)
Call:
lm(formula = thickness ~ status * sex * age * ulcer, data = TRAIN)
Residuals:
Min 1Q Median 3Q Max
-4.7433 -1.3598 -0.5319 0.5834 11.3838
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.60745 6.17420 1.394 0.16551
status -4.07618 3.24814 -1.255 0.21161
sex -15.50559 10.48788 -1.478 0.14156
age -0.10457 0.10424 -1.003 0.31753
ulcer -16.11057 8.04644 -2.002 0.04721 *
status:sex 8.13138 5.52054 1.473 0.14303
status:age 0.06471 0.05469 1.183 0.23874
sex:age 0.33742 0.18174 1.857 0.06549 .
status:ulcer 9.24986 4.40293 2.101 0.03746 *
sex:ulcer 26.53865 12.38508 2.143 0.03387 *
age:ulcer 0.35401 0.13450 2.632 0.00945 **
status:sex:age -0.17161 0.09619 -1.784 0.07660 .
status:sex:ulcer -12.56049 6.82652 -1.840 0.06791 .
status:age:ulcer -0.18545 0.07308 -2.538 0.01226 *
sex:age:ulcer -0.54850 0.21219 -2.585 0.01077 *
status:sex:age:ulcer 0.27148 0.11612 2.338 0.02082 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.721 on 139 degrees of freedom
Multiple R-squared: 0.3032, Adjusted R-squared: 0.228
F-statistic: 4.032 on 15 and 139 DF, p-value: 4.724e-06
<- predict(fit_multiv_intA, TRAIN )
pred_multiv_intA plot(TRAIN$thickness, pred_multiv_intA, pch = "*", col="red")
<- predict(fit_multiv_intA, VALID )
pred_multiv_valid_intA plot(VALID$thickness, pred_multiv_valid_intA, pch = "*", col="red")
anova(fit_multiv, fit_multiv_int1)
Analysis of Variance Table
Model 1: thickness ~ status + sex + age + ulcer
Model 2: thickness ~ status + sex + age + ulcer + sex:ulcer
Res.Df RSS Df Sum of Sq F Pr(>F)
1 150 1136.3
2 149 1132.6 1 3.7214 0.4896 0.4852
anova(fit_multiv_int1, fit_multiv_intA)
Analysis of Variance Table
Model 1: thickness ~ status + sex + age + ulcer + sex:ulcer
Model 2: thickness ~ status * sex * age * ulcer
Res.Df RSS Df Sum of Sq F Pr(>F)
1 149 1132.6
2 139 1029.1 10 103.48 1.3977 0.1874
par(mfrow=c(1,1))
<- mean(residuals(fit_multiv)^2)
MSE_train_multiv <- mean((pred_multiv_test - VALID$thickness)^2)
MSE_valid_multiv
<- mean(residuals(fit_multiv_int1)^2)
MSE_train_multiv_int1 <- mean((pred_multiv_valid_int1 - VALID$thickness)^2)
MSE_valid_multiv_int1
<- mean(residuals(fit_multiv_intA)^2)
MSE_train_multiv_intA <- mean((pred_multiv_valid_intA - VALID$thickness)^2)
MSE_valid_multiv_intA
<- rbind( c(MSE_train_multiv, MSE_train_multiv_int1, MSE_train_multiv_intA),c(MSE_valid_multiv,MSE_valid_multiv_int1,MSE_valid_multiv_intA) )
MSEs barplot(MSEs,beside=TRUE,ylim=c(0,10), names.arg = c("TR_M","TS_M","TR_MI1","TS_MI1","TR_MIA","TS_MIA"),las=2, ylab="MSE")
BONUS: Lasso regression with interaction terms
The code below illustrates how to implement lasso regression including as predictors interaction terms using as example the melanoma dataset:
<- 10^seq(2, -5, by = -.1)
lambdas <- as.formula(thickness ~ status * sex * age * ulcer+0)
f
<- model.matrix(f, TRAIN)
x <- as.matrix(TRAIN$thickness, ncol=1)
y
<- cv.glmnet(x, y, alpha=1, nfolds=10, lambda=lambdas)
LASSO_regr
<- LASSO_regr$lambda.min
lambda_best
<- glmnet(x, y, lambda=lambda_best)
g print(coef(g))
16 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 3.347973929
status -1.211868330
sex 0.310903562
age 0.017431402
ulcer 0.945514137
status:sex .
status:age .
sex:age .
status:ulcer .
sex:ulcer .
age:ulcer 0.008699109
status:sex:age .
status:sex:ulcer 0.469772793
status:age:ulcer .
sex:age:ulcer .
status:sex:age:ulcer .
Footnotes
However, consider what happens to the value of
adj.r.squared
when a model nearly perfectly interpolates the data. Can it still be trusted as a metric for these tasks?↩︎For more information on cross-validation check the wikipedia article: cross validation.↩︎
There exist some notable exceptions where extremely over-parameterized (\(p>n\)) models combined with specific model-fitting procedures exhibit the counter-intuitive behavior of decreasing in variance as parameters increase. For those interested in this fascinating phenomenon:
Bias-variance tradeoff: A modern perspective
Benign overfitting↩︎