Statistical modelling: Linear Regression

Author

Michael Stadler

Published

December 4, 2024

Introduction

In this lecture, we will take our first steps in statistical modeling using linear regression. We will see how such models can be fitted with data, how to use continuous and categorical inputs and how to use a fitted model for prediction.

Load the tidyverse package

We start by loading the packages needed for today’s session.

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)

Load data

For this session, getting the data is trivial, as we will use a built-in dataset called iris. You can use ?datasets::iris to learn more about it.

summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Here is an illustration of the variables contained in iris
(image source: https://www.datacamp.com/community/tutorials/machine-learning-in-r):
iris flowers

Association between two variables

We have seen before that the association between two variables can be measured for example by calculating their correlation coefficient:

# global correlation over all three species
cor(iris$Sepal.Length, iris$Petal.Length)

[1] 0.8717538

# correlation by species
iris |>
  group_by(Species) |>
  summarize(R = cor(Sepal.Length, Petal.Length), .groups = "drop")

# A tibble: 3 × 2
  Species        R
  <fct>      <dbl>
1 setosa     0.267
2 versicolor 0.754
3 virginica  0.864

# visualize
ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_point(aes(color = Species)) +
  theme_light(16)

What is linear regression

generally, a model formalizes the relationship between variables: \(y = f(x_i)\)
Notation: \(y\) and \(x_i\) (\(x_1\), \(x_2\), …) are random variables
by defining \(f(\ldots)\) we formalize an assumption about the type of relationship
linear models model or predict a variable (\(y\)) as a linear combination of one or several explanatory variables (\(x_i\)):
\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_nx_n\)
\(y\) is the response variable or dependent variable (it depends on the \(x_i\))
\(x_i\) are the independent variables or predictors (they allow prediction of \(y\))
\(\beta_i\) are the parameters to fit (coefficients).
In the simple case of \(y = \beta_0 + \beta_1x_1\), \(\beta_0\) is the intercept and \(\beta_1\) the slope)

What is linear regression: Residuals

The relationship between \(y\) and the \(x_i\) can be formally written as: \[ y = \beta_0 + \sum_{i}{\beta_i x_i} + \epsilon \]

or in combining the \(\beta_i\) values in a vector \(\beta\), and the \(x_i\) as columns in a matrix \(X\): \[ y = \beta_0 + \beta \cdot X + \epsilon \]

\(\epsilon\) contains the errors or residuals, often assumed to follow a normal distribution with mean zero: \(\epsilon \sim \mathcal{N}(0,\sigma^2)\)
\(\epsilon\) simply is the difference between the observed values \(y\) and the corresponding predicted values \(\hat{y} = \beta_0 + \sum_{i}{\beta_i x_i}\)

\[ \epsilon = y - \hat{y} \]

\(\epsilon\) is assumed to contain the variance of \(y\) that would be explained by omitted influences (predictors we forgot to include in the model, e.g. because we don’t know about their influence on \(y\)). In addition, any measurement error in \(y\) will also be absorbed by \(\epsilon\).
Measurement errors in the predictors \(x_i\) are assumed to be absent and are not modeled by standard linear regression.

Fitting a linear model

“Fitting a linear model” means finding values for \(\beta_i\) that minimize the residuals, for example the sum of their squares (ordinary least squares, \(j\) is the index of individual elements of the vectors \(\epsilon\), \(y\) and \(\hat{y}\)): \[ \min_{\beta_i} \sum_{j}\epsilon_j^2 = \sum_{j}(y_j - \hat{y_j})^2 \]

Here are is an illustration of the residuals for ten randomly selected observations in iris:

# A tibble: 10 × 7
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species      obs  pred
          <dbl>       <dbl>        <dbl>       <dbl> <fct>      <dbl> <dbl>
 1          6.2         2.8          4.8         1.8 virginica    4.8  4.42
 2          6.6         2.9          4.6         1.3 versicolor   4.6  5.16
 3          5.8         4            1.2         0.2 setosa       1.2  3.68
 4          5           3.4          1.5         0.2 setosa       1.5  2.19
 5          5.1         3.8          1.5         0.3 setosa       1.5  2.38
 6          6           2.7          5.1         1.6 versicolor   5.1  4.05
 7          6.1         2.9          4.7         1.4 versicolor   4.7  4.23
 8          5.5         2.4          3.8         1.1 versicolor   3.8  3.12
 9          6.7         3.3          5.7         2.1 virginica    5.7  5.35
10          7.2         3.2          6           1.8 virginica    6    6.28

Exercise 1

In the plot above, why are the illustrated residuals parallel to the y-axis, and not for example orthogonal to the regression line? What does this imply about what the residuals correspond to?

# Residuals are measured in 'y-units' (thus when visualized they are
# parallel to the y-axis). This for example follows from the definition
# of the residuals: epsilon = y_hat - y
# 
# The residuals thus capture unexplained variance (e.g. prediction
# errors that result from missing predictor or errors in the model
# formula), as well as measurement errors of y. Importantly, the
# residuals do not capture measurement errors in x_i.

Why doing linear regression?

Linear regression has many practical applications that often fall in one of two categories:

Predictive modeling (machine learning)

Predictive modeling aims to train a model on observed data,
for use in prediction of unobserved \(y\) using new values of \(x_i\).
This is for example useful when \(y\) is difficult to observe, but it’s simple to measure \(x_i\), and \(y\) can be well predicted using \(x_i\).

Studying association of \(x_i\) to \(y\) (statistical modeling)

Given \(y\) and several \(x_1, x_2, \ldots, x_n\) that may be related to \(y\), (linear) models can be applied to quantify the strength of the association between \(x_i\) and \(y\).
An \(x_i\) that is strongly associated to \(y\) is a strong predictor for \(y\):
the predictive performance of a model that includes \(x_i\) will be significantly better compared to the model without \(x_i\) as predictor.
If a set of \(x_i\) are well predicting \(y\), it is unlikely that there are many missing influences on \(y\) that we are not aware of.

More details about linear regression is available from
http://en.wikipedia.org/wiki/Linear_regression

Creating a linear model in R

In R, linear models are created using lm()
For a model \(y = \beta_0 + \beta_1x_1 + \beta_2x_2\), the formula is of the form: y ~ x1 + x2
- multiple regressors are combined using +
- the intercept (constant term) \(\beta_0\) is implicitly included, unless you say y ~ x1 + x2 + 0

Let’s look at an example:

fit <- lm(Petal.Length ~ Sepal.Length, data = iris)
fit


Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)

Coefficients:
 (Intercept)  Sepal.Length  
      -7.101         1.858

Understanding the result of a linear fit: `lm` object

lm() returns an R object of class lm, for which there are a number of methods available:

class(fit)

[1] "lm"

methods(class = "lm")

 [1] add1           alias          anova          augment       
 [5] case.names     coerce         confint        cooks.distance
 [9] deviance       dfbeta         dfbetas        drop1         
[13] dummy.coef     effects        extractAIC     family        
[17] formula        fortify        glance         hatvalues     
[21] influence      initialize     kappa          labels        
[25] logLik         model.frame    model.matrix   nobs          
[29] plot           predict        print          proj          
[33] qqnorm         qr             residuals      rstandard     
[37] rstudent       show           simulate       slotsFromS3   
[41] summary        tidy           variable.names vcov          
see '?methods' for accessing help and source code

Understanding the result of a linear fit: methods for `lm` objects

Let’s look at a few useful methods in more detail:

coefficients(fit)       # beta_i

 (Intercept) Sepal.Length 
   -7.101443     1.858433

head(residuals(fit))    # epsilon

          1           2           3           4           5           6 
-0.97656482 -0.60487822 -0.33319163  0.05265167 -0.79072152 -1.23409471

head(predict(fit))      # yhat

       1        2        3        4        5        6 
2.376565 2.004878 1.633192 1.447348 2.190722 2.934095

predict(fit, newdata = data.frame(Sepal.Length = c(4,5,6,7))) # "new" yhat

        1         2         3         4 
0.3322885 2.1907215 4.0491545 5.9075875

coefficients(fit) give you back the \(\beta_i\) (e.g. slope and intercept terms in the case of a single predictor).
residuals(fit) give you back the \(\epsilon\) values. You could for example calculate the residuals sum of squares using sum(residuals(fit)^2) (111.4591551).
predict(fit) give you back the fitted values for the response (the \(\hat{y}\)). You can also use this function to predict the response for new observations \(x_i\), using: predict(fit, newdata=data.frame).

Understanding the result of a linear fit: `summary(fit)`

summary(fit)            # detailed summary


Call:
lm(formula = Petal.Length ~ Sepal.Length, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.47747 -0.59072 -0.00668  0.60484  2.49512 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.10144    0.50666  -14.02   <2e-16 ***
Sepal.Length  1.85843    0.08586   21.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8678 on 148 degrees of freedom
Multiple R-squared:   0.76, Adjusted R-squared:  0.7583 
F-statistic: 468.6 on 1 and 148 DF,  p-value: < 2.2e-16

summary(fit) is a great summary of the overall results of the regression. Let’s get through it in more detail:
- Call reminds you of how you created the model
- Residuals give you a summary of \(\epsilon\) (ideally Gaussian-like)
- Coefficients are the fitted \(\beta_i\) (Estimate), together with measures of how important a given coefficient is for the model. In particular, Pr(>|t|) gives the probability of “H0: the coefficient is zero”
- R-squared is the fraction of variance in \(y\) that is capture by your model. A value of 1.0 would mean that the \(\hat{y}\) are identical to \(y\). For a model with a single predictor, this is equal to the square of the Pearson’s correlation coefficient.

summary(fit)$r.squared

[1] 0.7599546

cor(iris$Sepal.Length, iris$Petal.Length)^2

[1] 0.7599546

Obtaining tidy summaries from a fitted model

We have seen that information about a fitted model is stored in an object of class lm and can be obtained using accessor functions like coefficients(), residuals() or summary(). These typically do not return a data.frame and thus may be inconvenient for example in plotting with ggplot2.

The broom package implements accessors that always return a tibble:

tidy() summarizes information about model components
glance() reports information about the entire model
augment() adds information about individual observations to a dataset

tidy(fit)    # coefficients in a tidy data frame

# A tibble: 2 × 5
  term         estimate std.error statistic  p.value
  <chr>           <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     -7.10    0.507      -14.0 6.13e-29
2 Sepal.Length     1.86    0.0859      21.6 1.04e-47

glance(fit)  # model summary

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.760         0.758 0.868      469. 1.04e-47     1  -191.  387.  396.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

augment(fit) # training data augmented

# A tibble: 150 × 8
   Petal.Length Sepal.Length .fitted  .resid    .hat .sigma   .cooksd
          <dbl>        <dbl>   <dbl>   <dbl>   <dbl>  <dbl>     <dbl>
 1          1.4          5.1    2.38 -0.977  0.0121   0.867 0.00783  
 2          1.4          4.9    2.00 -0.605  0.0154   0.869 0.00385  
 3          1.3          4.7    1.63 -0.333  0.0195   0.870 0.00149  
 4          1.5          4.6    1.45  0.0527 0.0218   0.871 0.0000419
 5          1.4          5      2.19 -0.791  0.0136   0.868 0.00581  
 6          1.7          5.4    2.93 -1.23   0.00859  0.865 0.00884  
 7          1.4          4.6    1.45 -0.0473 0.0218   0.871 0.0000339
 8          1.5          5      2.19 -0.691  0.0136   0.869 0.00444  
 9          1.4          4.4    1.08  0.324  0.0271   0.870 0.00200  
10          1.5          4.9    2.00 -0.505  0.0154   0.870 0.00268  
# ℹ 140 more rows
# ℹ 1 more variable: .std.resid <dbl>

We will use these functions below for plotting.

Visualizing a linear fit in R

The lm object has its own plot method in base R, which produces plots to assess the quality of a linear model (for example to check if the model assumptions are fulfilled), see also ?plot.lm:

par(mfrow = c(2,2))
plot(fit)

Most often you will however want to plot observed versus fitted values (first plot below), or add a trend-line to a plot of predictors versus response (second plot):

# compare observed to predicted
ggplot(augment(fit), aes(Petal.Length, .fitted)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  theme_light(16)

# add a trendline
ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +
  geom_point() +
  geom_line(data = augment(fit), inherit.aes = FALSE,
            mapping = aes(Sepal.Length, .fitted),
            linetype = "dashed") +
  theme_light(16)

Note that in the second plot above, we have used augment(fit) to access the .fitted values from fit and plot them. ggplot2 also provides geom_smooth(..., method = "lm") which can be used to automatically fit the linear model and add the resulting regression line to the plot. If the fitted model object is not needed, this may be a more convenient way to visualize a linear fit.

Exercise 2

Where are the model predictions in the second plot above?
Re-create the second plot above, and add the points \((x_i, \hat{y}_i)\) with \(x_i \in \{5, 6, 7\}\) (new Sepal.Length values) and corresponding \(\hat{y}_i\) (predicted Petal.Length) to it.

# predict Petal.Length for new data using `fit`
nd <- data.frame(Sepal.Length = c(5, 6, 7)) |>
  mutate(Petal.Length = predict(fit, newdata = data.frame(Sepal.Length)))
nd

  Sepal.Length Petal.Length
1            5     2.190722
2            6     4.049154
3            7     5.907587

# plot
ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_point(aes(color = Species)) +
  geom_line(data = augment(fit), inherit.aes = FALSE,
            mapping = aes(Sepal.Length, .fitted),
            linetype = "dashed") +
  geom_point(data = nd, shape = 1, size = 4) +
  theme_light(16)

Try to add the trend-line to the above plot using geom_abline() instead of geom_line(), by extracting the intercept and slope directly from coefficients(fit) or tidy(fit).

# ... using coefficients(fit)
ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +
  geom_point() +
  geom_abline(intercept = coefficients(fit)[1],
              slope = coefficients(fit)[2],
              linetype = "dashed") +
  theme_light(16)

# ... using tidy(fit)
ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +
  geom_point() +
  geom_abline(intercept = tidy(fit)$estimate[1],
              slope = tidy(fit)$estimate[2],
              linetype = "dashed") +
  theme_light(16)

Create a linear model that predicts Petal.Length using Petal.Width as a predictor

fit2 <- lm(Petal.Length ~ Petal.Width, data = iris)
summary(fit2)


Call:
lm(formula = Petal.Length ~ Petal.Width, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33542 -0.30347 -0.02955  0.25776  1.39453 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.08356    0.07297   14.85   <2e-16 ***
Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4782 on 148 degrees of freedom
Multiple R-squared:  0.9271,    Adjusted R-squared:  0.9266 
F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

What is the model’s performance (is it better or worse than the one using Sepal.Length)?

summary(fit)$r.squared

[1] 0.7599546

summary(fit2)$r.squared

[1] 0.9271098

glance(fit)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.760         0.758 0.868      469. 1.04e-47     1  -191.  387.  396.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

glance(fit2)

# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.927         0.927 0.478     1882. 4.68e-86     1  -101.  208.  217.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Predict Petal.Length using this new model for a new flower with a Petal.Width of 3.5.

predict(fit2, newdata = data.frame(Petal.Width = 3.5))

      1 
8.88835

The `model.matrix()` function

This section is about the implementation of linear models in R, and is not strictly required to understand and use lm() productively. It may however make certain aspects more intuitive, like the concept of a design matrix (which has important applications, for example when analyzing transcriptomics experiments with edgeR or DESeq2 packages) and what dummy variables are (discussed further below).

Reminder: When we fit a linear model of the form: \[ y = \beta_0 + \beta \cdot X + \epsilon \]

we try to find values for \(\beta_0\) and \(\beta\) that will minimize the \(\epsilon\). As mentioned, \(\beta\) are the coefficients and \(X\) is a matrix whose columns are the predictors. In the analysis of experimental groups, it is used to encode the experimental design, by using predictors that for example the observations in \(y\) to experimental groups like control and treated. In this context, \(X\) is called the design matrix of the model.

Above we used lm(Petal.Length ~ Sepal.Length, data = iris) to create a linear model, providing a formula and the data. \(X\) will be constructed internally of lm() by model.matrix():

model.matrix(~ Sepal.Length, data = iris[1:6, ])

  (Intercept) Sepal.Length
1           1          5.1
2           1          4.9
3           1          4.7
4           1          4.6
5           1          5.0
6           1          5.4
attr(,"assign")
[1] 0 1

Each column of \(X\) corresponds to a coefficient in the model, and an observation \(i\) (for example Petal.Length[3] when i = 3) is represented in the model by the sum of the elements in the \(i\)-th row of \(X\), multiplied by their \(\beta\) (for example \(1 * \beta_0 + 5.1 * \beta_{Sepal.Length}\)).

With \(X\) in hand, it becomes very simple to calculate the model predictions using linear algebra (dot product of coefficients vector \(\beta\) with design matrix \(X\)): \[ \hat y = \beta \cdot X \]

Here, \(\beta\) includes the constant (intercept) term as its first element, and \(X\) contains a first column called (Intercept) with only values of 1.

Categorical predictors

Regressors are usually numerical vectors, but they can also be categorical vectors (factors). This is particularly useful to include a predictor that classifies the observations into multiple groups, such as control and treated.
lm() uses model.matrix() (see above) to automatically transforms a factor predictor f into nlevels(f) - 1 so called dummy variables.
A coefficient is fit for each dummy variable (for each level of the factor, except for the first):
- The first factor level is ignored and absorbed into the intercept term (predictors must not be collinear).
- For each additional level of f, the dummy variable is a numerical vector with values of one (for the observation that belong to that level) or zero (for the remaining observations).
model.matrix() is the function that interprets the formula argument of lm(), if necessary transforms categorical variables into a set of dummy variables, and creates the matrix \(X\) in the linear model: \[ y = \beta_0 + \beta \cdot X + \epsilon \]

For example, the following factor:

f <- factor(c("control","control","drug A","drug A","drug B","drug B"))
f

[1] control control drug A  drug A  drug B  drug B 
Levels: control drug A drug B

nlevels(f)

[1] 3

is automatically transformed into the following by lm() (which internally calls model.matrix()):

model.matrix(~ f)

  (Intercept) fdrug A fdrug B
1           1       0       0
2           1       0       0
3           1       1       0
4           1       1       0
5           1       0       1
6           1       0       1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$f
[1] "contr.treatment"

You can see that the first level of f ("control") has disappeared, while the remaining ones have been converted to dummy variables with only zeros or ones. This is required to be able to calculate with categories.

In lm(~ Sepal.Length, data = iris), model.matrix() creates:

head(model.matrix(~ Sepal.Length, data = iris))

  (Intercept) Sepal.Length
1           1          5.1
2           1          4.9
3           1          4.7
4           1          4.6
5           1          5.0
6           1          5.4

In the model matrix \(X\), numerical predictors are thus represented by a single column containing the predictor values itself, while a categorical predictor with n levels is represented by n-1 columns containing 0 or 1 values.

Categorical predictors are widely used for example to encode experimental designs, such as in differential gene expression analysis based on (generalized) linear models (edgeR and DESeq2 packages):

            (Intercept) grouptreatment
control_1             1              0
control_2             1              0
control_3             1              0
treatment_4           1              1
treatment_5           1              1
treatment_6           1              1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"

In this experiment, two coefficients will be fit ((Intercept) and grouptreatment, one for each column of the design matrix). A control sample is represented as 1 * (Intercept) + 0 * grouptreatment, and a treated sample as 1 * (Intercept) + 1 * grouptreatment. In other words, (Intercept) represents the average of the control samples, and grouptreament the difference of the averages over treated and control samples.

Multiple linear regression

Multiple predictors can be used in the model formula by combining them with +.
Let’s add Species to the model from before:

# model matrix...
model.matrix(~ Sepal.Length + Species, data = iris)[c(1:2, 51:52, 101:102), ]

    (Intercept) Sepal.Length Speciesversicolor Speciesvirginica
1             1          5.1                 0                0
2             1          4.9                 0                0
51            1          7.0                 1                0
52            1          6.4                 1                0
101           1          6.3                 0                1
102           1          5.8                 0                1

# ... and corresponding fit
fit <- lm(Petal.Length ~ Sepal.Length + Species, data = iris)
coefficients(fit)

      (Intercept)      Sepal.Length Speciesversicolor  Speciesvirginica 
       -1.7023422         0.6321099         2.2101378         3.0900021

summary(fit)


Call:
lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.76390 -0.17875  0.00716  0.17461  0.79954 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.70234    0.23013  -7.397 1.01e-11 ***
Sepal.Length       0.63211    0.04527  13.962  < 2e-16 ***
Speciesversicolor  2.21014    0.07047  31.362  < 2e-16 ***
Speciesvirginica   3.09000    0.09123  33.870  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2826 on 146 degrees of freedom
Multiple R-squared:  0.9749,    Adjusted R-squared:  0.9744 
F-statistic:  1890 on 3 and 146 DF,  p-value: < 2.2e-16

Exercise 3

Did adding Species affect slope, intercept or both? Hint: visualize the fitted values from above in a Sepal.Length versus Petal.Length plot, or think about how the model matrix \(X\) and the coefficients represent the observations.

fit <- lm(Petal.Length ~ Sepal.Length + Species, data = iris)
fit


Call:
lm(formula = Petal.Length ~ Sepal.Length + Species, data = iris)

Coefficients:
      (Intercept)       Sepal.Length  Speciesversicolor  
          -1.7023             0.6321             2.2101  
 Speciesvirginica  
           3.0900

ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_point(aes(color = Species), shape = 1) +
  geom_point(data = augment(fit), inherit.aes = FALSE,
            mapping = aes(Sepal.Length, .fitted)) +
  theme_light(16)

# yhat = beta * X
#   beta:
coefficients(fit)

      (Intercept)      Sepal.Length Speciesversicolor  Speciesvirginica 
       -1.7023422         0.6321099         2.2101378         3.0900021

#   X:
model.matrix(Petal.Length ~ Sepal.Length +
               Species, data = iris)[c(1:2, 51:52, 101:102), ]

    (Intercept) Sepal.Length Speciesversicolor Speciesvirginica
1             1          5.1                 0                0
2             1          4.9                 0                0
51            1          7.0                 1                0
52            1          6.4                 1                0
101           1          6.3                 0                1
102           1          5.8                 0                1

# adding `Species` to the model allows for a species-specific intercept
# (for example `(Intercept)` + `Speciesvirginica` for virginica flowers),
# but uses the same slope `Sepal.Length` for all three species.

Try to add the trend-lines for each species to the above plot using geom_abline(), by extracting the intercept and slope directly from coefficients(fit).

ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +
  geom_point() +
  # setosa
  geom_abline(intercept = coefficients(fit)[1],
              slope = coefficients(fit)[2],
              colour = 2) +
  # versicolor
  geom_abline(intercept = coefficients(fit)[1] + coefficients(fit)[3],
              slope = coefficients(fit)[2],
              colour = 3) +
  # virginica
  geom_abline(intercept = coefficients(fit)[1] + coefficients(fit)[4],
              slope = coefficients(fit)[2],
              colour = 4) +
  theme_light(16)

Simulate a set of three random variables \(a\), \(b\) and \(z\) that represent gene expression levels. Assume that the three genes are part of a regulatory chain, according to: \(a \to b \to z\). Hint: You can simulate a dependent variable by adding noise to the variable that it depends on:

set.seed(0)
a <- rnorm(100, 0, 1)
b <- a + rnorm(100, 0, 1)
z <- b + rnorm(100, 0, 3)

Now create linear models that predict \(z\) using \(a\) and \(b\) individually as well as in combination. Compare the coefficients for \(a\) and \(b\) and explain.

lm(z ~ a)


Call:
lm(formula = z ~ a)

Coefficients:
(Intercept)            a  
     0.1477       1.3714

lm(z ~ b)


Call:
lm(formula = z ~ b)

Coefficients:
(Intercept)            b  
     0.2035       1.0796

lm(z ~ a + b)


Call:
lm(formula = z ~ a + b)

Coefficients:
(Intercept)            a            b  
     0.1936       0.2983       0.9423

Interaction terms

the example above showed that adding Species allowed us to have a species-specific intercept, which improved our model fit (r.squared from 0.76 to 0.975)
even better would be a species-specific slope (the Sepal.Length coefficient), as the slope above is clearly not optimal for the setosa (red) species)
this can be achieved by including an interaction term Sepal.Length:Species, or simply by combining the terms in the model formula using * instead of +, resulting in enumeration of all possible interaction terms:

fit <- lm(Petal.Length ~ Sepal.Length * Species, data = iris)
# same as:
# fit <- lm(Petal.Length ~ Sepal.Length + Species + Sepal.Length:Species, data = iris)
fit


Call:
lm(formula = Petal.Length ~ Sepal.Length * Species, data = iris)

Coefficients:
                   (Intercept)                    Sepal.Length  
                        0.8031                          0.1316  
             Speciesversicolor                Speciesvirginica  
                       -0.6179                         -0.1926  
Sepal.Length:Speciesversicolor   Sepal.Length:Speciesvirginica  
                        0.5548                          0.6184

ggplot(iris, aes(Sepal.Length, Petal.Length)) +
  geom_point(aes(color = Species), shape = 1) +
  geom_point(data = augment(fit), inherit.aes = FALSE,
            mapping = aes(Sepal.Length, .fitted)) +
  theme_light(16)

# yhat = beta * X
#   beta:
coefficients(fit)

                   (Intercept)                   Sepal.Length 
                     0.8030518                      0.1316317 
             Speciesversicolor               Speciesvirginica 
                    -0.6179363                     -0.1925838 
Sepal.Length:Speciesversicolor  Sepal.Length:Speciesvirginica 
                     0.5548381                      0.6184491

#   X:
model.matrix(Petal.Length ~ Sepal.Length *
               Species, data = iris)[c(1:2, 51:52, 101:102), ]

    (Intercept) Sepal.Length Speciesversicolor Speciesvirginica
1             1          5.1                 0                0
2             1          4.9                 0                0
51            1          7.0                 1                0
52            1          6.4                 1                0
101           1          6.3                 0                1
102           1          5.8                 0                1
    Sepal.Length:Speciesversicolor Sepal.Length:Speciesvirginica
1                              0.0                           0.0
2                              0.0                           0.0
51                             7.0                           0.0
52                             6.4                           0.0
101                            0.0                           6.3
102                            0.0                           5.8

This model with interaction terms has many more coefficients than the simpler models before. The Sepal.Length coefficient, which corresponded to the common slope before, now corresponds to the slope of the first species (first level in Species: setosa), and the interaction terms (e.g. Sepal.Length:Speciesversicolor) are the “slope correction” for observations in another species (e.g. versicolor). Even for a relatively simple model, it can become difficult to know exactly how fitted coefficients are combined to predict a given observation. Charlotte Soneson has created a Bioconductor package called ExploreModelMatrix (available from http://bioconductor.org/packages/ExploreModelMatrix/), which provides a shiny application for exploration of the model matrix and resulting coefficients. This is unfortunately beyond the scope of today’s lecture, but we might re-visit it in our advanced lecture on model-based differential gene expression.

Here is a screenshot of ExploreModelMatrix in action:

Exercise 4

Try to add the trend-lines for each species to the above plot (Sepal.Length versus Petal.Length, modeled as Petal.Length ~ Sepal.Length * Species) using geom_abline(), by extracting the intercepts and slopes directly from coefficients(fit).

fit <- lm(Petal.Length ~ Sepal.Length * Species, data = iris)

ggplot(iris, aes(Sepal.Length, Petal.Length, color = Species)) +
  geom_point() +
  # setosa
  geom_abline(intercept = coefficients(fit)[1],
              slope = coefficients(fit)[2],
              colour = 2) +
  # versicolor
  geom_abline(intercept = coefficients(fit)[1] + coefficients(fit)[3],
              slope = coefficients(fit)[2] + coefficients(fit)[5],
              colour = 3) +
  # virginica
  geom_abline(intercept = coefficients(fit)[1] + coefficients(fit)[4],
              slope = coefficients(fit)[2] + coefficients(fit)[6],
              colour = 4) +
  theme_light(16)

Linear models for non-linear problems

The simple form of linear models should not be a reason to underestimate their power:

under certain assumptions, a linear model can be fit exactly and efficiently, i.e. the global optimum for \(\min_{\beta_i} \sum_{j}(y_j - \hat{y_j})^2\) can be identified analytically (no need for numerical optimization, no risk of ending up in a local minimum)
main assumptions: linear relationship, residuals normality, no collinearity, no autocorrelation, residuals’ variance the same for different predictor values
\(x_i\) can be (non-linear) transformations of observed variables. This allows for example the fitting of polynomial or exponential associations, using \(y = \alpha + \beta_1 x + \beta_2 x^2 + \ldots + \beta_n x^n\) or \(y = \alpha + \beta \log(x)\).
The same observed variable can be included several times in the model, as long as each inclusion is transformed differently and there is no perfectly correlated pair of \(x_i\).
Strongly correlated regressors make it hard or impossible to fit the model by ordinary least squares, but variants of linear regression can still be used (e.g. regularized linear regression, Lasso regression, Ridge regression, principal component regression, etc.).

Application example: Exponential rate estimation using linear regression

Let’s take the example of exponential decay (e.g. cancer cells dying at a constant rate upon exposure to a toxic drug). The goal is to estimate the half-live of the cells from the observed cell counts at different time points.

The cell number at time \(t\) assuming a half-live \(\tau_{1/2}\) is: \[ N(t) = N_0 \cdot 2^{-t/\tau_{1/2}} \]

For example, after one half-live (\(t_1 = \tau_{1/2}\)), the remaining living cells are: \[ N(t_1 = \tau_{1/2}) = N_0 \cdot 2^{-\tau_{1/2}/\tau_{1/2}} = N_0 \cdot 2^{-1} = N_0 / 2 \]

# simulate exponential data with noise
set.seed(8)
x <- rep(1:9, each = 3) # three replicate measurements per timepoint
y <- 1e6 * 2^(-x / 1.5) * rnorm(length(x), 1, 0.2) + rnorm(length(x), 0, 1e4)
decay <- data.frame(time = x, cell_count = y)

# visualize
ggplot(decay, aes(time, cell_count)) +
  geom_point(shape = 1) +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6)) +
  labs(x = "Time (h)", y = "Cell count (Mio.)") +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.1, size = 5,
           label = "simulated data\n(half-live = 1.5h)") +
  theme_light(16)

Making an exponential process linear by log-transformation

If we transform everything into \(\log_2\) space, we get: \[ \log_2 N(t) = \log_2[N_0 \cdot 2^{-t/\tau_{1/2}}] = (-1/\tau_{1/2}) \cdot t + (\log_2 N_0) \]

In other words, the log-transformation yields us a linear equation of the form
\[ y = \beta_1x + \beta_0 \]

where \(\beta_1\) (the slope) is equal to \(-1/\tau_{1/2}\), so we can calculate the half-live using:
\[ \tau_{1/2} = -1 / \beta_1 \]

Exercise 5

Use this information to estimate the half-live from \(x\) and \(y\) by fitting a linear model.

# fit
fit <- lm(log2(cell_count) ~ time, data = decay)
t12 <- -1 / coefficients(fit)[2]
t12

    time 
1.536178

# visualize
ggplot(decay, aes(time, log2(cell_count))) +
  geom_point(shape = 1) +
  labs(x = "Time (h)", y = "log2(Cell count)") +
  geom_line(data = augment(fit), inherit.aes = FALSE,
            mapping = aes(time, .fitted)) +
  annotate("text", x = Inf, y = Inf, hjust = 1.1, vjust = 1.1, size = 5,
           label = paste0("y = ",
                          round(coefficients(fit)[2], 2), "x + ",
                          round(coefficients(fit)[1], 2), "\n",
                          "half-live = ",round(t12, 3), "h")) +
  theme_light(16)