tidyverse: pipe operator, filter, select, mutate

Author

Charlotte Soneson

Published

October 2, 2024

Introduction

In the previous lecture we saw how to read data from a text file into R, where it would be represented in a tibble or data.frame. In this lecture, we will see how to work with the resulting object, and explore its content.

We will mostly use functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. For some aspects, we will also illustrate how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.

Exercise 0

Create a script (a regular *.R script, an *.Rmd R Markdown file, or a *.qmd Quarto file) that will be used to save all the code that we will write today.

Load the tidyverse package

In order to make the functionality of the tidyverse available in our R session, we first have to load the packages. Here, we will load the “meta-package” called tidyverse, which in turn will load a collection of packages providing different functionality for data analysis.

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data

Next, we read the data that we will be using. Like last week, we will use the Palmer Penguins data set for our illustrations. We will assume that you have downloaded the file into a subfolder named data/ - if not, the following code from last week will download the file:

# first we need to create a `data` directory
dir.create("data")
# then we can download the file
download.file(url = "https://ivanek.github.io/introductionToR/data/penguins.csv",
              destfile = "data/penguins.csv")

You can check the content of the data/ subfolder of the current working directory with the command list.files("data") (there should be a penguins.csv in the list).

This data set contains records for 344 penguins from three different species, collected from 3 islands in the Palmer Archipelago, Antarctica.

dat <- readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(dat)

spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
 $ sex              : chr [1:344] "male" "female" "female" NA ...
 $ year             : num [1:344] 2007 2007 2007 2007 2007 ...
 - attr(*, "spec")=
  .. cols(
  ..   species = col_character(),
  ..   island = col_character(),
  ..   bill_length_mm = col_double(),
  ..   bill_depth_mm = col_double(),
  ..   flipper_length_mm = col_double(),
  ..   body_mass_g = col_double(),
  ..   sex = col_character(),
  ..   year = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Subsetting

A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. For example, we may know which row numbers we want to keep, or we may only want to retain rows where one of the measured variables have a certain value (e.g., all records from 2007). As is usually the case with R, there are many ways of achieving the desired result, and we will consider a few different options.

To retain a pre-specified set of rows, we can use the slice function from dplyr:

# first row
slice(dat, 1)

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
# ℹ 2 more variables: sex <chr>, year <dbl>

# first, third and fifth row
slice(dat, c(1, 3, 5))

# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           40.3          18                 195        3250
3 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>

# all rows except the first 10
slice(dat, -(1:10))

# A tibble: 334 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           37.8          17.1               186
 2 Adelie  Torgersen           37.8          17.3               180
 3 Adelie  Torgersen           41.1          17.6               182
 4 Adelie  Torgersen           38.6          21.2               191
 5 Adelie  Torgersen           34.6          21.1               198
 6 Adelie  Torgersen           36.6          17.8               185
 7 Adelie  Torgersen           38.7          19                 195
 8 Adelie  Torgersen           42.5          20.7               197
 9 Adelie  Torgersen           34.4          18.4               184
10 Adelie  Torgersen           46            21.5               194
# ℹ 324 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Similarly, the select function, also from dplyr, allows us to retain a given set of columns.

# first column
select(dat, 1)

# A tibble: 344 × 1
   species
   <chr>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows

# all columns except the first 3
select(dat, -(1:3))

# A tibble: 344 × 5
   bill_depth_mm flipper_length_mm body_mass_g sex     year
           <dbl>             <dbl>       <dbl> <chr>  <dbl>
 1          18.7               181        3750 male    2007
 2          17.4               186        3800 female  2007
 3          18                 195        3250 female  2007
 4          NA                  NA          NA <NA>    2007
 5          19.3               193        3450 female  2007
 6          20.6               190        3650 male    2007
 7          17.8               181        3625 female  2007
 8          19.6               195        4675 male    2007
 9          18.1               193        3475 <NA>    2007
10          20.2               190        4250 <NA>    2007
# ℹ 334 more rows

To subset by both row and column, we need to combine these operations. This can be done in several ways. The one that first may come to mind would be to first apply one of the operations (e.g., subset by row), store the intermediate result in a new variable, and then apply the column selection.

temp <- slice(dat, 1:5)
select(temp, 1:3)

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

We could also write the expression above using a nested function call, without the intermediate variable, but this is already getting difficult to read!

select(slice(dat, 1:5), 1:3)

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>).

Side note

Perhaps you have seen the pipe operator written as %>%. This pipe operator comes from the magrittr package, while the pipe operator we use here (|>) is provided with R itself. In most aspects, the two are equivalent.

The pipe operator can be read as “then”, and allows you to take the output from a command and send it directly into another command. The assumption (if not otherwise specified) is that the object that is being sent forward (e.g., dat) is the first argument of the function it is being sent into. This is the case for the slice and select functions that we used above - the first argument is a tibble or a data.frame (check ?slice). Moreover, the output of each of these functions is another object of the same type as the input (e.g. tibble or data.frame), which can then be sent further into the next function operating on this type of object. Let’s see how this works with a few examples. First, the following formulations are equivalent ways of selecting the first five rows of dat:

slice(dat, 1:5)

# A tibble: 5 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           39.5          17.4               186        3800
3 Adelie  Torge…           40.3          18                 195        3250
4 Adelie  Torge…           NA            NA                  NA          NA
5 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>

dat |> slice(1:5)

# A tibble: 5 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           39.5          17.4               186        3800
3 Adelie  Torge…           40.3          18                 195        3250
4 Adelie  Torge…           NA            NA                  NA          NA
5 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>

## We can be explicit about where the piped object (represented by `_`) goes
dat |> slice(.data = _, 1:5)

# A tibble: 5 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           39.5          17.4               186        3800
3 Adelie  Torge…           40.3          18                 195        3250
4 Adelie  Torge…           NA            NA                  NA          NA
5 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>

Next, let’s use the pipe to subset our tibble to the first five rows and the first three columns:

dat |>
  slice(1:5) |>
  select(1:3)

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

This can be read as “take the dat object, then slice it to retain the first five rows, then select the first three columns”. While this may, perhaps, seem like an overly complicated way of subsetting a tibble, the ability to split up a procedure into smaller steps and chain them together in a logical and comprehensible way, can be very valuable as the processing pipeline becomes more complex.

So far, we only used row and column indices to subset our tibble. In fact, especially for column selection, it is typically recommended to use the column names, if they are available. This protects us from obtaining unexpected results if (for example) an additional column was added to the tibble, or if the columns were reordered. We can find the column names of the tibble using the colnames() function:

colnames(dat)

[1] "species"           "island"            "bill_length_mm"   
[4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
[7] "sex"               "year"

dat |>
  slice(1:5) |>
  select(species, island, bill_length_mm)

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

There are many ways of specifying which columns to select - an overview is provided here. For example, the following command selects the species column, plus all columns that start with bill:

dat |> select(species, starts_with("bill"))

# A tibble: 344 × 3
   species bill_length_mm bill_depth_mm
   <chr>            <dbl>         <dbl>
 1 Adelie            39.1          18.7
 2 Adelie            39.5          17.4
 3 Adelie            40.3          18  
 4 Adelie            NA            NA  
 5 Adelie            36.7          19.3
 6 Adelie            39.3          20.6
 7 Adelie            38.9          17.8
 8 Adelie            39.2          19.6
 9 Adelie            34.1          18.1
10 Adelie            42            20.2
# ℹ 334 more rows

It is worth noting that all operations above have generated a new tibble. Sometimes, we want to extract a single column from the data, as a vector. This can be achieved using the pull function:

head(dat |> pull(species))

[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"

Base R outlook

Subsetting can also be done using base R commands. One way to do it is to use square brackets, as follows:

# first row
dat[1, ]

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
# ℹ 2 more variables: sex <chr>, year <dbl>

# first, third and fifth row
dat[c(1, 3, 5), ]

# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           40.3          18                 195        3250
3 Adelie  Torge…           36.7          19.3               193        3450
# ℹ 2 more variables: sex <chr>, year <dbl>

# all rows except the first 10
dat[-(1:10), ]

# A tibble: 334 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           37.8          17.1               186
 2 Adelie  Torgersen           37.8          17.3               180
 3 Adelie  Torgersen           41.1          17.6               182
 4 Adelie  Torgersen           38.6          21.2               191
 5 Adelie  Torgersen           34.6          21.1               198
 6 Adelie  Torgersen           36.6          17.8               185
 7 Adelie  Torgersen           38.7          19                 195
 8 Adelie  Torgersen           42.5          20.7               197
 9 Adelie  Torgersen           34.4          18.4               184
10 Adelie  Torgersen           46            21.5               194
# ℹ 324 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

# first column
dat[, 1]

# A tibble: 344 × 1
   species
   <chr>  
 1 Adelie 
 2 Adelie 
 3 Adelie 
 4 Adelie 
 5 Adelie 
 6 Adelie 
 7 Adelie 
 8 Adelie 
 9 Adelie 
10 Adelie 
# ℹ 334 more rows

# all columns except the first 3
dat[, -(1:3)]

# A tibble: 344 × 5
   bill_depth_mm flipper_length_mm body_mass_g sex     year
           <dbl>             <dbl>       <dbl> <chr>  <dbl>
 1          18.7               181        3750 male    2007
 2          17.4               186        3800 female  2007
 3          18                 195        3250 female  2007
 4          NA                  NA          NA <NA>    2007
 5          19.3               193        3450 female  2007
 6          20.6               190        3650 male    2007
 7          17.8               181        3625 female  2007
 8          19.6               195        4675 male    2007
 9          18.1               193        3475 <NA>    2007
10          20.2               190        4250 <NA>    2007
# ℹ 334 more rows

# first five rows, first three columns
dat[1:5, 1:3]

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

# subset by column name
dat[1:5, c("species", "island", "bill_length_mm")]

# A tibble: 5 × 3
  species island    bill_length_mm
  <chr>   <chr>              <dbl>
1 Adelie  Torgersen           39.1
2 Adelie  Torgersen           39.5
3 Adelie  Torgersen           40.3
4 Adelie  Torgersen           NA  
5 Adelie  Torgersen           36.7

The bracket-based subsetting is an important building block of R, and is applicable to many different types of objects (not just tibbles and data.frames).
An important shortcut to extract a single column is provided by the $ sign:

# extract the first five elements in the 'species' column
dat$species[1:5]

[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"

Warning

There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector.

## convert tibble to data.frame
datdf <- as.data.frame(dat)

## bracket subsetting of data.frame -> vector
head(datdf[, "species"]) # vector

[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"

## bracket subsetting of tibble -> tibble
head(dat[, "species"]) # tibble

# A tibble: 6 × 1
  species
  <chr>  
1 Adelie 
2 Adelie 
3 Adelie 
4 Adelie 
5 Adelie 
6 Adelie

## bracket subsetting of data.frame with drop = FALSE -> data.frame
head(datdf[, "species", drop = FALSE]) # data.frame

  species
1  Adelie
2  Adelie
3  Adelie
4  Adelie
5  Adelie
6  Adelie

## select returns object of the same type in both cases
head(dat |> select(species)) # tibble

# A tibble: 6 × 1
  species
  <chr>  
1 Adelie 
2 Adelie 
3 Adelie 
4 Adelie 
5 Adelie 
6 Adelie

head(datdf |> select(species)) # data.frame

  species
1  Adelie
2  Adelie
3  Adelie
4  Adelie
5  Adelie
6  Adelie

## $ returns vector in both cases
head(dat$species) # vector

[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"

head(datdf$species) # vector

[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"

Exercise 1

What is the value of the variable island for the penguin in the 83rd row of dat?

Solution 1

dat[83, "island"]

# A tibble: 1 × 1
  island   
  <chr>    
1 Torgersen

dat |>
  slice(83) |>
  select(island)

# A tibble: 1 × 1
  island   
  <chr>    
1 Torgersen

Exercise 2

Create a new object containing only the data in the last row of dat.

Solution 2

(datlast <- dat[nrow(dat), ])

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Chinst… Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <chr>, year <dbl>

(datlast <- dat |>
    slice(nrow(dat)))

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Chinst… Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <chr>, year <dbl>

Subsetting based on observed values

In the examples above, we selected the rows to keep manually. Another common way to subset data sets is using the observed values. For example, let’s say we only want to keep records from 2007. In the tidyverse, this is achieved by the filter function.

dat |>
  filter(year == 2007)

# A tibble: 110 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 100 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Implicitly, the filtering here is performed via a logical vector, generated by the logical expression year == 2007. The filter function will look for a column named year in the tibble, and the expression will generate a vector with values that are TRUE for all rows corresponding to records from 2007, and FALSE otherwise. The subsetting is then performed by only returning rows corresponding to TRUE values. We can combine multiple logical expressions. Consider the following illustrative examples:

## "and" -> TRUE where both vectors are TRUE
c(TRUE, FALSE, TRUE) & c(TRUE, TRUE, FALSE)

[1]  TRUE FALSE FALSE

## "or" -> TRUE where at least one vector is TRUE
c(TRUE, FALSE, TRUE) | c(TRUE, TRUE, FALSE)

[1] TRUE TRUE TRUE

For example, to only retain records from 2007 and from the ‘Torgersen’ island:

dat |>
  filter(year == 2007 & island == "Torgersen")

# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

# can also be done in multiple steps
dat |>
  filter(year == 2007) |>
  filter(island == "Torgersen")

# A tibble: 20 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
11 Adelie  Torgersen           37.8          17.1               186
12 Adelie  Torgersen           37.8          17.3               180
13 Adelie  Torgersen           41.1          17.6               182
14 Adelie  Torgersen           38.6          21.2               191
15 Adelie  Torgersen           34.6          21.1               198
16 Adelie  Torgersen           36.6          17.8               185
17 Adelie  Torgersen           38.7          19                 195
18 Adelie  Torgersen           42.5          20.7               197
19 Adelie  Torgersen           34.4          18.4               184
20 Adelie  Torgersen           46            21.5               194
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

A common application of this process is to remove rows with missing values for some column. The function is.na() will return TRUE if a value is missing, and FALSE otherwise. We can start by getting an overview of the number of missing values in each column.

summary(is.na(dat))

  species          island        bill_length_mm  bill_depth_mm  
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:344       FALSE:344       FALSE:342       FALSE:342      
                                 TRUE :2         TRUE :2        
 flipper_length_mm body_mass_g        sex             year        
 Mode :logical     Mode :logical   Mode :logical   Mode :logical  
 FALSE:342         FALSE:342       FALSE:333       FALSE:344      
 TRUE :2           TRUE :2         TRUE :11

Exercise 3

How many records in dat are made in 2007?

Solution 3

dat |>
  filter(year == 2007) |>
  nrow()

[1] 110

Exercise 4

Subset dat to only penguins of species “Adelie” and flipper length above 190 mm. How many such records are there? What is the average body mass for these penguins?

Solution 4

dat_al <- dat |>
  filter(species == "Adelie" & flipper_length_mm > 190)
nrow(dat_al)

[1] 67

dat_al |>
  pull(body_mass_g) |>
  mean(na.rm = TRUE)

[1] 3891.045

Exercise 5

Create a new tibble called dat_no_na by removing the rows with missing values. Check the summary again to make sure that no missing values remain.

Solution 5

dat_no_na <- dat |>
  filter(!is.na(bill_length_mm) & 
           !is.na(bill_depth_mm) & 
           !is.na(flipper_length_mm) & 
           !is.na(body_mass_g) & 
           !is.na(sex))
summary(is.na(dat_no_na))

  species          island        bill_length_mm  bill_depth_mm  
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:333       FALSE:333       FALSE:333       FALSE:333      
 flipper_length_mm body_mass_g        sex             year        
 Mode :logical     Mode :logical   Mode :logical   Mode :logical  
 FALSE:333         FALSE:333       FALSE:333       FALSE:333

# alternative, using a helper function from tidyr
dat_no_na_2 <- dat |>
  drop_na()
summary(is.na(dat_no_na_2))

  species          island        bill_length_mm  bill_depth_mm  
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:333       FALSE:333       FALSE:333       FALSE:333      
 flipper_length_mm body_mass_g        sex             year        
 Mode :logical     Mode :logical   Mode :logical   Mode :logical  
 FALSE:333         FALSE:333       FALSE:333       FALSE:333

## check that objects are equal
all.equal(dat_no_na, dat_no_na_2, check.attributes = FALSE)

[1] TRUE

## another nice way to compare objects
waldo::compare(dat_no_na, dat_no_na_2, ignore_attr = TRUE)

✔ No differences

Exercise 6

Subset dat_no_na to only male penguins of the ‘Adelie’ species. How many such records are there? Calculate the average value of their flipper length.

Solution 6

dat_am <- dat_no_na |>
  filter(species == "Adelie" & sex == "male")
nrow(dat_am)

[1] 73

dat_am |>
  pull(flipper_length_mm) |>
  mean()

[1] 192.411

Exercise 7

Subset dat_no_na to only male penguins of either the Chinstrap or Gentoo species.

Solution 7

dat_no_na |>
  filter(species %in% c("Chinstrap", "Gentoo") & sex == "male")

# A tibble: 95 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>           <dbl>         <dbl>             <dbl>
 1 Gentoo  Biscoe           50            16.3               230
 2 Gentoo  Biscoe           50            15.2               218
 3 Gentoo  Biscoe           47.6          14.5               215
 4 Gentoo  Biscoe           46.7          15.3               219
 5 Gentoo  Biscoe           46.8          15.4               215
 6 Gentoo  Biscoe           49            16.1               216
 7 Gentoo  Biscoe           48.4          14.6               213
 8 Gentoo  Biscoe           49.3          15.7               217
 9 Gentoo  Biscoe           49.2          15.2               221
10 Gentoo  Biscoe           48.7          15.1               222
# ℹ 85 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

## equivalent to
dat_no_na |>
  filter((species == "Chinstrap" | species == "Gentoo") & sex == "male")

# A tibble: 95 × 8
   species island bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>           <dbl>         <dbl>             <dbl>
 1 Gentoo  Biscoe           50            16.3               230
 2 Gentoo  Biscoe           50            15.2               218
 3 Gentoo  Biscoe           47.6          14.5               215
 4 Gentoo  Biscoe           46.7          15.3               219
 5 Gentoo  Biscoe           46.8          15.4               215
 6 Gentoo  Biscoe           49            16.1               216
 7 Gentoo  Biscoe           48.4          14.6               213
 8 Gentoo  Biscoe           49.3          15.7               217
 9 Gentoo  Biscoe           49.2          15.2               221
10 Gentoo  Biscoe           48.7          15.1               222
# ℹ 85 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Exercise 8

Create a new tibble by keeping the rows with missing values. Do they have something in common?

Solution 8

dat_na <- dat |>
  filter(is.na(bill_length_mm) |
           is.na(bill_depth_mm) | 
           is.na(flipper_length_mm) | 
           is.na(body_mass_g) | 
           is.na(sex))
## There's nothing obvious in common between all these records
dat_na |> as.data.frame()

   species    island bill_length_mm bill_depth_mm flipper_length_mm
1   Adelie Torgersen             NA            NA                NA
2   Adelie Torgersen           34.1          18.1               193
3   Adelie Torgersen           42.0          20.2               190
4   Adelie Torgersen           37.8          17.1               186
5   Adelie Torgersen           37.8          17.3               180
6   Adelie     Dream           37.5          18.9               179
7   Gentoo    Biscoe           44.5          14.3               216
8   Gentoo    Biscoe           46.2          14.4               214
9   Gentoo    Biscoe           47.3          13.8               216
10  Gentoo    Biscoe           44.5          15.7               217
11  Gentoo    Biscoe             NA            NA                NA
   body_mass_g  sex year
1           NA <NA> 2007
2         3475 <NA> 2007
3         4250 <NA> 2007
4         3300 <NA> 2007
5         3700 <NA> 2007
6         2975 <NA> 2007
7         4100 <NA> 2007
8         4650 <NA> 2008
9         4725 <NA> 2009
10        4875 <NA> 2009
11          NA <NA> 2009

Adding a new column to a data set, or replacing an existing column

To add a new column, or to replace an existing column, we can use the mutate function from dplyr. As before, we can use the pipe operator to apply this function to our tibble:

# add a column with all values equal to 1
dat |>
  mutate(new_column = 1)

# A tibble: 344 × 9
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>,
#   new_column <dbl>

# replace the sex column by the first letter only
dat |>
  mutate(sex = substr(sex, 1, 1))

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Exercise 9

Add another column to dat that contains the sum of the bill_length_mm and bill_depth_mm columns.

Solution 9

dat <- dat |>
  mutate(lpd = bill_length_mm + bill_depth_mm)

Exercise 10

Create another column in dat that contains the bill length in cm. Display only the species column and the newly created column.

Solution 10

dat |>
  mutate(bill_length_cm = bill_length_mm / 10) |>
  select(species, bill_length_cm)

# A tibble: 344 × 2
   species bill_length_cm
   <chr>            <dbl>
 1 Adelie            3.91
 2 Adelie            3.95
 3 Adelie            4.03
 4 Adelie           NA   
 5 Adelie            3.67
 6 Adelie            3.93
 7 Adelie            3.89
 8 Adelie            3.92
 9 Adelie            3.41
10 Adelie            4.2 
# ℹ 334 more rows

Finding all unique combinations

The distinct() function returns all distinct rows in a data.frame or tibble. We can also supply a set of columns, and dplyr will first select those columns.

dat |>
  distinct(species, island)

# A tibble: 5 × 2
  species   island   
  <chr>     <chr>    
1 Adelie    Torgersen
2 Adelie    Biscoe   
3 Adelie    Dream    
4 Gentoo    Biscoe   
5 Chinstrap Dream

We can also count the number of observations for each of the distinct combinations.

dat |>
  count(species, island)

# A tibble: 5 × 3
  species   island        n
  <chr>     <chr>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Exercise 11

Verify using filter that there are no records for Chinstrap penguins on the Biscoe island.

Solution 11

dat |>
  filter(species == "Chinstrap" & island == "Biscoe")

# A tibble: 0 × 9
# ℹ 9 variables: species <chr>, island <chr>, bill_length_mm <dbl>,
#   bill_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
#   sex <chr>, year <dbl>, lpd <dbl>

More information

dplyr cheatsheet
Data transformation chapter of ‘R for data science’

Session info

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4     
 [5] purrr_1.0.2      readr_2.1.5      tidyr_1.3.1      tibble_3.2.1    
 [9] ggplot2_3.5.1    tidyverse_2.0.0  BiocStyle_2.32.1 png_0.1-8       
[13] knitr_1.48      

loaded via a namespace (and not attached):
 [1] utf8_1.2.4          generics_0.1.3      stringi_1.8.4      
 [4] hms_1.1.3           digest_0.6.37       magrittr_2.0.3     
 [7] evaluate_1.0.0      timechange_0.3.0    fastmap_1.2.0      
[10] jsonlite_1.8.9      BiocManager_1.30.25 fansi_1.0.6        
[13] scales_1.3.0        cli_3.6.3           rlang_1.1.4        
[16] crayon_1.5.3        bit64_4.0.5         munsell_0.5.1      
[19] withr_3.0.1         yaml_2.3.10         tools_4.4.1        
[22] parallel_4.4.1      tzdb_0.4.0          colorspace_2.1-1   
[25] vctrs_0.6.5         R6_2.5.1            lifecycle_1.0.4    
[28] htmlwidgets_1.6.4   bit_4.5.0           vroom_1.6.5        
[31] pkgconfig_2.0.3     waldo_0.5.3         archive_1.1.9      
[34] pillar_1.9.0        gtable_0.3.5        glue_1.7.0         
[37] xfun_0.47           tidyselect_1.2.1    rstudioapi_0.16.0  
[40] htmltools_0.5.8.1   rmarkdown_2.28      compiler_4.4.1