In this lecture, we will take our first steps in statistical modeling using linear regression. We will see how such models can be fitted with data, how to use continuous and categorical inputs and how to use a fitted model for prediction.
Load the tidyverse package
We start by loading the packages needed for today’s session.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
Load data
For this session, getting the data is trivial, as we will use a built-in dataset called iris. You can use ?datasets::iris to learn more about it.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Here is an illustration of the variables contained in iris
(image source: https://www.datacamp.com/community/tutorials/machine-learning-in-r):
Association between two variables
We have seen before that the association between two variables can be measured for example by calculating their correlation coefficient:
# global correlation over all three speciescor(iris$Sepal.Length, iris$Petal.Length)
[1] 0.8717538
# correlation by speciesiris |>group_by(Species) |>summarize(R =cor(Sepal.Length, Petal.Length), .groups ="drop")
# A tibble: 3 × 2
Species R
<fct> <dbl>
1 setosa 0.267
2 versicolor 0.754
3 virginica 0.864
generally, a model formalizes the relationship between variables: \(y = f(x_i)\)
Notation: \(y\) and \(x_i\) (\(x_1\), \(x_2\), …) are random variables
by defining \(f(\ldots)\) we formalize an assumption about the type of relationship
linear models model or predict a variable (\(y\)) as a linear combination of one or several explanatory variables (\(x_i\)): \(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n\)
\(y\) is the response variable or dependent variable (it depends on the \(x_i\))
\(x_i\) are the independent variables or predictors (they allow prediction of \(y\))
\(\beta_i\) are the parameters to fit (coefficients).
In the simple case of \(y = \beta_0 + \beta_1x_1\), \(\beta_0\) is the intercept and \(\beta_1\) the slope)
What is linear regression: Residuals
The relationship between \(y\) and the \(x_i\) can be formally written as: \[
y = \beta_0 + \sum_{i}{\beta_i x_i} + \epsilon
\]
or in combining the \(\beta_i\) values in a vector \(\beta\), and the \(x_i\) as columns in a matrix \(X\): \[
y = \beta_0 + \beta \cdot X + \epsilon
\]
\(\epsilon\) contains the errors or residuals, often assumed to follow a normal distribution with mean zero: \(\epsilon \sim \mathcal{N}(0,\sigma^2)\)
\(\epsilon\) simply is the difference between the observed values \(y\) and the corresponding predicted values \(\hat{y} = \beta_0 + \sum_{i}{\beta_i x_i}\)
\[
\epsilon = y - \hat{y}
\]
\(\epsilon\) is assumed to contain the variance of \(y\) that would be explained by omitted influences (predictors we forgot to include in the model, e.g. because we don’t know about their influence on \(y\)). In addition, any measurement error in \(y\) will also be absorbed by \(\epsilon\).
Measurement errors in the predictors \(x_i\) are assumed to be absent and are not modeled by standard linear regression.
Fitting a linear model
“Fitting a linear model” means finding values for \(\beta_i\) that minimize the residuals, for example the sum of their squares (ordinary least squares, \(j\) is the index of individual elements of the vectors \(\epsilon\), \(y\) and \(\hat{y}\)): \[
\min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2
\]
Here are is an illustration of the residuals for ten randomly selected observations in iris:
In the plot above, why are the illustrated residuals parallel to the y-axis, and not for example orthogonal to the regression line? What does this imply about what the residuals correspond to?
# Residuals are measured in 'y-units' (thus when visualized they are# parallel to the y-axis). This for example follows from the definition# of the residuals: epsilon = y_hat - y# # The residuals thus capture unexplained variance (e.g. prediction# errors that result from missing predictor or errors in the model# formula), as well as measurement errors of y. Importantly, the# residuals do not capture measurement errors in x_i.
Why doing linear regression?
Linear regression has many practical applications that often fall in one of two categories:
Predictive modeling (machine learning)
Predictive modeling aims to train a model on observed data,
for use in prediction of unobserved \(y\) using new values of \(x_i\).
This is for example useful when \(y\) is difficult to observe, but it’s simple to measure \(x_i\), and \(y\) can be well predicted using \(x_i\).
Studying association of \(x_i\) to \(y\) (statistical modeling)
Given \(y\) and several \(x_1, x_2, \ldots, x_n\) that may be related to \(y\), (linear) models can be applied to quantify the strength of the association between \(x_i\) and \(y\).
An \(x_i\) that is strongly associated to \(y\) is a strong predictor for \(y\):
the predictive performance of a model that includes \(x_i\) will be significantly better compared to the model without \(x_i\) as predictor.
If a set of \(x_i\) are well predicting \(y\), it is unlikely that there are many missing influences on \(y\) that we are not aware of.
More details about linear regression is available from
http://en.wikipedia.org/wiki/Linear_regression
Creating a linear model in R
In R, linear models are created using lm()
For a model \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\), the formula is of the form: y ~ x1 + x2
multiple regressors are combined using +
the intercept (constant term) \(\beta_0\) is implicitly included, unless you say y ~ x1 + x2 + 0
Let’s look at an example:
fit <-lm(Petal.Length ~ Sepal.Length, data = iris)fit
coefficients(fit) give you back the \(\beta_i\) (e.g. slope and intercept terms in the case of a single predictor).
residuals(fit) give you back the \(\epsilon\) values. You could for example calculate the residuals sum of squares using sum(residuals(fit)^2) (111.4591551).
predict(fit) give you back the fitted values for the response (the \(\hat{y}\)). You can also use this function to predict the response for new observations \(x_i\), using: predict(fit, newdata=data.frame).
Understanding the result of a linear fit: summary(fit)
summary(fit) # detailed summary
Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)
Residuals:
Min 1Q Median 3Q Max
-2.47747 -0.59072 -0.00668 0.60484 2.49512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
summary(fit) is a great summary of the overall results of the regression. Let’s get through it in more detail:
Call reminds you of how you created the model
Residuals give you a summary of \(\epsilon\) (ideally Gaussian-like)
Coefficients are the fitted \(\beta_i\) (Estimate), together with measures of how important a given coefficient is for the model. In particular, Pr(>|t|) gives the probability of “H0: the coefficient is zero”
R-squared is the fraction of variance in \(y\) that is capture by your model. A value of 1.0 would mean that the \(\hat{y}\) are identical to \(y\). For a model with a single predictor, this is equal to the square of the Pearson’s correlation coefficient.
summary(fit)$r.squared
[1] 0.7599546
cor(iris$Sepal.Length, iris$Petal.Length)^2
[1] 0.7599546
Obtaining tidy summaries from a fitted model
We have seen that information about a fitted model is stored in an object of class lm and can be obtained using accessor functions like coefficients(), residuals() or summary(). These typically do not return a data.frame and thus may be inconvenient for example in plotting with ggplot2.
The broom package implements accessors that always return a tibble:
tidy() summarizes information about model components
glance() reports information about the entire model
augment() adds information about individual observations to a dataset
The lm object has its own plot method in base R, which produces plots to assess the quality of a linear model (for example to check if the model assumptions are fulfilled), see also ?plot.lm:
par(mfrow =c(2,2))plot(fit)
Most often you will however want to plot observed versus fitted values (first plot below), or add a trend-line to a plot of predictors versus response (second plot):
# add a trendlineggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +geom_point() +geom_line(data =augment(fit), inherit.aes =FALSE,mapping =aes(Sepal.Length, .fitted),linetype ="dashed") +theme_light(16)
Note that in the second plot above, we have used augment(fit) to access the .fitted values from fit and plot them. ggplot2 also provides geom_smooth(..., method = "lm") which can be used to automatically fit the linear model and add the resulting regression line to the plot. If the fitted model object is not needed, this may be a more convenient way to visualize a linear fit.
Exercise 2
Where are the model predictions in the second plot above?
Re-create the second plot above, and add the points \((x_i, \hat{y}_i)\) with \(x_i \in \{5, 6, 7\}\) (new Sepal.Length values) and corresponding \(\hat{y}_i\) (predicted Petal.Length) to it.
# predict Petal.Length for new data using `fit`nd <-data.frame(Sepal.Length =c(5, 6, 7)) |>mutate(Petal.Length =predict(fit, newdata =data.frame(Sepal.Length)))nd
Try to add the trend-line to the above plot using geom_abline() instead of geom_line(), by extracting the intercept and slope directly from coefficients(fit) or tidy(fit).
# ... using coefficients(fit)ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +geom_point() +geom_abline(intercept =coefficients(fit)[1],slope =coefficients(fit)[2],linetype ="dashed") +theme_light(16)
# ... using tidy(fit)ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +geom_point() +geom_abline(intercept =tidy(fit)$estimate[1],slope =tidy(fit)$estimate[2],linetype ="dashed") +theme_light(16)
Create a linear model that predicts Petal.Length using Petal.Width as a predictor
fit2 <-lm(Petal.Length ~ Petal.Width, data = iris)summary(fit2)
Call:
lm(formula = Petal.Length ~ Petal.Width, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.33542 -0.30347 -0.02955 0.25776 1.39453
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.08356 0.07297 14.85 <2e-16 ***
Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4782 on 148 degrees of freedom
Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
What is the model’s performance (is it better or worse than the one using Sepal.Length)?
This section is about the implementation of linear models in R, and is not strictly required to understand and use lm() productively. It may however make certain aspects more intuitive, like the concept of a design matrix (which has important applications, for example when analyzing transcriptomics experiments with edgeR or DESeq2 packages) and what dummy variables are (discussed further below).
Reminder: When we fit a linear model of the form: \[
y = \beta_0 + \beta \cdot X + \epsilon
\]
we try to find values for \(\beta_0\) and \(\beta\) that will minimize the \(\epsilon\). As mentioned, \(\beta\) are the coefficients and \(X\) is a matrix whose columns are the predictors. In the analysis of experimental groups, it is used to encode the experimental design, by using predictors that for example the observations in \(y\) to experimental groups like control and treated. In this context, \(X\) is called the design matrix of the model.
Above we used lm(Petal.Length ~ Sepal.Length, data = iris) to create a linear model, providing a formula and the data. \(X\) will be constructed internally of lm() by model.matrix():
Each column of \(X\) corresponds to a coefficient in the model, and an observation \(i\) (for example Petal.Length[3] when i = 3) is represented in the model by the sum of the elements in the \(i\)-th row of \(X\), multiplied by their \(\beta\) (for example \(1 * \beta_0 + 5.1 * \beta_{Sepal.Length}\)).
With \(X\) in hand, it becomes very simple to calculate the model predictions using linear algebra (dot product of coefficients vector \(\beta\) with design matrix \(X\)): \[
\hat y = \beta \cdot X
\]
Here, \(\beta\) includes the constant (intercept) term as its first element, and \(X\) contains a first column called (Intercept) with only values of 1.
Categorical predictors
Regressors are usually numerical vectors, but they can also be categorical vectors (factors). This is particularly useful to include a predictor that classifies the observations into multiple groups, such as control and treated.
lm() uses model.matrix() (see above) to automatically transforms a factor predictor f into nlevels(f) - 1 so called dummy variables.
A coefficient is fit for each dummy variable (for each level of the factor, except for the first):
The first factor level is ignored and absorbed into the intercept term (predictors must not be collinear).
For each additional level of f, the dummy variable is a numerical vector with values of one (for the observation that belong to that level) or zero (for the remaining observations).
model.matrix() is the function that interprets the formula argument of lm(), if necessary transforms categorical variables into a set of dummy variables, and creates the matrix \(X\) in the linear model: \[
y = \beta_0 + \beta \cdot X + \epsilon
\]
For example, the following factor:
f <-factor(c("control","control","drug A","drug A","drug B","drug B"))f
[1] control control drug A drug A drug B drug B
Levels: control drug A drug B
nlevels(f)
[1] 3
is automatically transformed into the following by lm() (which internally calls model.matrix()):
You can see that the first level of f ("control") has disappeared, while the remaining ones have been converted to dummy variables with only zeros or ones. This is required to be able to calculate with categories.
In lm(~ Sepal.Length, data = iris), model.matrix() creates:
In the model matrix \(X\), numerical predictors are thus represented by a single column containing the predictor values itself, while a categorical predictor with n levels is represented by n-1 columns containing 0 or 1 values.
Categorical predictors are widely used for example to encode experimental designs, such as in differential gene expression analysis based on (generalized) linear models (edgeR and DESeq2 packages):
In this experiment, two coefficients will be fit ((Intercept) and grouptreatment, one for each column of the design matrix). A control sample is represented as 1 * (Intercept) + 0 * grouptreatment, and a treated sample as 1 * (Intercept) + 1 * grouptreatment. In other words, (Intercept) represents the average of the control samples, and grouptreament the difference of the averages over treated and control samples.
Multiple linear regression
Multiple predictors can be used in the model formula by combining them with +.
Let’s add Species to the model from before:
# model matrix...model.matrix(~ Sepal.Length + Species, data = iris)[c(1:2, 51:52, 101:102), ]
Call:
lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-0.76390 -0.17875 0.00716 0.17461 0.79954
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.70234 0.23013 -7.397 1.01e-11 ***
Sepal.Length 0.63211 0.04527 13.962 < 2e-16 ***
Speciesversicolor 2.21014 0.07047 31.362 < 2e-16 ***
Speciesvirginica 3.09000 0.09123 33.870 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2826 on 146 degrees of freedom
Multiple R-squared: 0.9749, Adjusted R-squared: 0.9744
F-statistic: 1890 on 3 and 146 DF, p-value: < 2.2e-16
Exercise 3
Did adding Species affect slope, intercept or both? Hint: visualize the fitted values from above in a Sepal.Length versus Petal.Length plot, or think about how the model matrix \(X\) and the coefficients represent the observations.
fit <-lm(Petal.Length ~ Sepal.Length + Species, data = iris)fit
# adding `Species` to the model allows for a species-specific intercept# (for example `(Intercept)` + `Speciesvirginica` for virginica flowers),# but uses the same slope `Sepal.Length` for all three species.
Try to add the trend-lines for each species to the above plot using geom_abline(), by extracting the intercept and slope directly from coefficients(fit).
Simulate a set of three random variables \(a\), \(b\) and \(z\) that represent gene expression levels. Assume that the three genes are part of a regulatory chain, according to: \(a \to b \to z\). Hint: You can simulate a dependent variable by adding noise to the variable that it depends on:
set.seed(0)a <-rnorm(100, 0, 1)b <- a +rnorm(100, 0, 1)z <- b +rnorm(100, 0, 3)
Now create linear models that predict \(z\) using \(a\) and \(b\) individually as well as in combination. Compare the coefficients for \(a\) and \(b\) and explain.
lm(z ~ a)
Call:
lm(formula = z ~ a)
Coefficients:
(Intercept) a
0.1477 1.3714
lm(z ~ b)
Call:
lm(formula = z ~ b)
Coefficients:
(Intercept) b
0.2035 1.0796
lm(z ~ a + b)
Call:
lm(formula = z ~ a + b)
Coefficients:
(Intercept) a b
0.1936 0.2983 0.9423
Interaction terms
the example above showed that adding Species allowed us to have a species-specific intercept, which improved our model fit (r.squared from 0.76 to 0.975)
even better would be a species-specific slope (the Sepal.Length coefficient), as the slope above is clearly not optimal for the setosa (red) species)
this can be achieved by including an interaction term Sepal.Length:Species, or simply by combining the terms in the model formula using * instead of +, resulting in enumeration of all possible interaction terms:
fit <-lm(Petal.Length ~ Sepal.Length * Species, data = iris)# same as:# fit <- lm(Petal.Length ~ Sepal.Length + Species + Sepal.Length:Species, data = iris)fit
This model with interaction terms has many more coefficients than the simpler models before. The Sepal.Length coefficient, which corresponded to the common slope before, now corresponds to the slope of the first species (first level in Species: setosa), and the interaction terms (e.g. Sepal.Length:Speciesversicolor) are the “slope correction” for observations in another species (e.g. versicolor). Even for a relatively simple model, it can become difficult to know exactly how fitted coefficients are combined to predict a given observation. Charlotte Soneson has created a Bioconductor package called ExploreModelMatrix (available from http://bioconductor.org/packages/ExploreModelMatrix/), which provides a shiny application for exploration of the model matrix and resulting coefficients. This is unfortunately beyond the scope of today’s lecture, but we might re-visit it in our advanced lecture on model-based differential gene expression.
Here is a screenshot of ExploreModelMatrix in action:
Exercise 4
Try to add the trend-lines for each species to the above plot (Sepal.Length versus Petal.Length, modeled as Petal.Length ~ Sepal.Length * Species) using geom_abline(), by extracting the intercepts and slopes directly from coefficients(fit).
The simple form of linear models should not be a reason to underestimate their power:
under certain assumptions, a linear model can be fit exactly and efficiently, i.e. the global optimum for \(\min_{\beta_i} \sum_{j}(y_j - \hat{y_j})^2\) can be identified analytically (no need for numerical optimization, no risk of ending up in a local minimum)
main assumptions: linear relationship, residuals normality, no collinearity, no autocorrelation, residuals’ variance the same for different predictor values
\(x_i\) can be (non-linear) transformations of observed variables. This allows for example the fitting of polynomial or exponential associations, using \(y = \alpha + \beta_1 x + \beta_2 x^2 + \ldots + \beta_n x^n\) or \(y = \alpha + \beta \log(x)\).
The same observed variable can be included several times in the model, as long as each inclusion is transformed differently and there is no perfectly correlated pair of \(x_i\).
Application example: Exponential rate estimation using linear regression
Let’s take the example of exponential decay (e.g. cancer cells dying at a constant rate upon exposure to a toxic drug). The goal is to estimate the half-live of the cells from the observed cell counts at different time points.
The cell number at time \(t\) assuming a half-live \(\tau_{1/2}\) is: \[
N(t) = N_0 \cdot 2^{-t/\tau_{1/2}}
\]
For example, after one half-live (\(t_1 = \tau_{1/2}\)), the remaining living cells are: \[
N(t_1 = \tau_{1/2}) = N_0 \cdot 2^{-\tau_{1/2}/\tau_{1/2}} = N_0 \cdot 2^{-1} = N_0 / 2
\]
# simulate exponential data with noiseset.seed(8)x <-rep(1:9, each =3) # three replicate measurements per timepointy <-1e6*2^(-x /1.5) *rnorm(length(x), 1, 0.2) +rnorm(length(x), 0, 1e4)decay <-data.frame(time = x, cell_count = y)# visualizeggplot(decay, aes(time, cell_count)) +geom_point(shape =1) +scale_y_continuous(labels = scales::label_number(scale =1e-6)) +labs(x ="Time (h)", y ="Cell count (Mio.)") +annotate("text", x =Inf, y =Inf, hjust =1.1, vjust =1.1, size =5,label ="simulated data\n(half-live = 1.5h)") +theme_light(16)
Making an exponential process linear by log-transformation
If we transform everything into \(\log_2\) space, we get: \[
\log_2 N(t) = \log_2[N_0 \cdot 2^{-t/\tau_{1/2}}] = (-1/\tau_{1/2}) \cdot t + (\log_2 N_0)
\]
In other words, the log-transformation yields us a linear equation of the form \[
y = \beta_1x + \beta_0
\]
where \(\beta_1\) (the slope) is equal to \(-1/\tau_{1/2}\), so we can calculate the half-live using: \[
\tau_{1/2} = -1 / \beta_1
\]
Exercise 5
Use this information to estimate the half-live from \(x\) and \(y\) by fitting a linear model.
# fitfit <-lm(log2(cell_count) ~ time, data = decay)t12 <--1/coefficients(fit)[2]t12