In the previous lecture we saw how to read data from a text file into R, where it would be represented in a tibble or data.frame. In this lecture, we will see how to work with the resulting object, and explore its content.
We will mostly use functions from the tidyverse, which provides a modern, widely used collection of R packages designed for data science. For some aspects, we will also illustrate how to perform the same operations using “base R” commands, that is, functions that are loaded with R by default rather than provided in a separate add-on package.
Exercise 0
Create a script (a regular *.R script, an *.Rmd R Markdown file, or a *.qmd Quarto file) that will be used to save all the code that we will write today.
Load the tidyverse package
In order to make the functionality of the tidyverse available in our R session, we first have to load the packages. Here, we will load the “meta-package” called tidyverse, which in turn will load a collection of packages providing different functionality for data analysis.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load data
Next, we read the data that we will be using. Like last week, we will use the Palmer Penguins data set for our illustrations. We will assume that you have downloaded the file into a subfolder named data/ - if not, the following code from last week will download the file:
# first we need to create a `data` directorydir.create("data")# then we can download the filedownload.file(url ="https://ivanek.github.io/introductionToR/data/penguins.csv",destfile ="data/penguins.csv")
You can check the content of the data/ subfolder of the current working directory with the command list.files("data") (there should be a penguins.csv in the list).
This data set contains records for 344 penguins from three different species, collected from 3 islands in the Palmer Archipelago, Antarctica.
dat <- readr::read_csv("data/penguins.csv")
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(dat)
spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
$ sex : chr [1:344] "male" "female" "female" NA ...
$ year : num [1:344] 2007 2007 2007 2007 2007 ...
- attr(*, "spec")=
.. cols(
.. species = col_character(),
.. island = col_character(),
.. bill_length_mm = col_double(),
.. bill_depth_mm = col_double(),
.. flipper_length_mm = col_double(),
.. body_mass_g = col_double(),
.. sex = col_character(),
.. year = col_double()
.. )
- attr(*, "problems")=<externalptr>
Subsetting
A common operation is to subset data sets, that is, to only retain some of the records. The specification of what to retain can be done in many different ways. For example, we may know which row numbers we want to keep, or we may only want to retain rows where one of the measured variables have a certain value (e.g., all records from 2007). As is usually the case with R, there are many ways of achieving the desired result, and we will consider a few different options.
To retain a pre-specified set of rows, we can use the slice function from dplyr:
# first rowslice(dat, 1)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
# ℹ 2 more variables: sex <chr>, year <dbl>
# first, third and fifth rowslice(dat, c(1, 3, 5))
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 40.3 18 195 3250
3 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <dbl>
# all columns except the first 3select(dat, -(1:3))
# A tibble: 344 × 5
bill_depth_mm flipper_length_mm body_mass_g sex year
<dbl> <dbl> <dbl> <chr> <dbl>
1 18.7 181 3750 male 2007
2 17.4 186 3800 female 2007
3 18 195 3250 female 2007
4 NA NA NA <NA> 2007
5 19.3 193 3450 female 2007
6 20.6 190 3650 male 2007
7 17.8 181 3625 female 2007
8 19.6 195 4675 male 2007
9 18.1 193 3475 <NA> 2007
10 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
To subset by both row and column, we need to combine these operations. This can be done in several ways. The one that first may come to mind would be to first apply one of the operations (e.g., subset by row), store the intermediate result in a new variable, and then apply the column selection.
temp <-slice(dat, 1:5)select(temp, 1:3)
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
We could also write the expression above using a nested function call, without the intermediate variable, but this is already getting difficult to read!
select(slice(dat, 1:5), 1:3)
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
A more intuitive way, which reduces the number of intermediate variables that are explicitly created and scales to more complex operations, is to use the pipe operator (|>).
Side note
Perhaps you have seen the pipe operator written as %>%. This pipe operator comes from the magrittr package, while the pipe operator we use here (|>) is provided with R itself. In most aspects, the two are equivalent.
The pipe operator can be read as “then”, and allows you to take the output from a command and send it directly into another command. The assumption (if not otherwise specified) is that the object that is being sent forward (e.g., dat) is the first argument of the function it is being sent into. This is the case for the slice and select functions that we used above - the first argument is a tibble or a data.frame (check ?slice). Moreover, the output of each of these functions is another object of the same type as the input (e.g. tibble or data.frame), which can then be sent further into the next function operating on this type of object. Let’s see how this works with a few examples. First, the following formulations are equivalent ways of selecting the first five rows of dat:
slice(dat, 1:5)
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 39.5 17.4 186 3800
3 Adelie Torge… 40.3 18 195 3250
4 Adelie Torge… NA NA NA NA
5 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <dbl>
dat |>slice(1:5)
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 39.5 17.4 186 3800
3 Adelie Torge… 40.3 18 195 3250
4 Adelie Torge… NA NA NA NA
5 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <dbl>
## We can be explicit about where the piped object (represented by `_`) goesdat |>slice(.data = _, 1:5)
# A tibble: 5 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 39.5 17.4 186 3800
3 Adelie Torge… 40.3 18 195 3250
4 Adelie Torge… NA NA NA NA
5 Adelie Torge… 36.7 19.3 193 3450
# ℹ 2 more variables: sex <chr>, year <dbl>
Next, let’s use the pipe to subset our tibble to the first five rows and the first three columns:
dat |>slice(1:5) |>select(1:3)
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
This can be read as “take the dat object, then slice it to retain the first five rows, then select the first three columns”. While this may, perhaps, seem like an overly complicated way of subsetting a tibble, the ability to split up a procedure into smaller steps and chain them together in a logical and comprehensible way, can be very valuable as the processing pipeline becomes more complex.
So far, we only used row and column indices to subset our tibble. In fact, especially for column selection, it is typically recommended to use the column names, if they are available. This protects us from obtaining unexpected results if (for example) an additional column was added to the tibble, or if the columns were reordered. We can find the column names of the tibble using the colnames() function:
dat |>slice(1:5) |>select(species, island, bill_length_mm)
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
There are many ways of specifying which columns to select - an overview is provided here. For example, the following command selects the species column, plus all columns that start with bill:
It is worth noting that all operations above have generated a new tibble. Sometimes, we want to extract a single column from the data, as a vector. This can be achieved using the pull function:
# A tibble: 344 × 5
bill_depth_mm flipper_length_mm body_mass_g sex year
<dbl> <dbl> <dbl> <chr> <dbl>
1 18.7 181 3750 male 2007
2 17.4 186 3800 female 2007
3 18 195 3250 female 2007
4 NA NA NA <NA> 2007
5 19.3 193 3450 female 2007
6 20.6 190 3650 male 2007
7 17.8 181 3625 female 2007
8 19.6 195 4675 male 2007
9 18.1 193 3475 <NA> 2007
10 20.2 190 4250 <NA> 2007
# ℹ 334 more rows
# first five rows, first three columnsdat[1:5, 1:3]
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
# subset by column namedat[1:5, c("species", "island", "bill_length_mm")]
# A tibble: 5 × 3
species island bill_length_mm
<chr> <chr> <dbl>
1 Adelie Torgersen 39.1
2 Adelie Torgersen 39.5
3 Adelie Torgersen 40.3
4 Adelie Torgersen NA
5 Adelie Torgersen 36.7
The bracket-based subsetting is an important building block of R, and is applicable to many different types of objects (not just tibbles and data.frames).
An important shortcut to extract a single column is provided by the $ sign:
# extract the first five elements in the 'species' columndat$species[1:5]
[1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
Warning
There is an important difference between a tibble and a data.frame when it comes to subsetting using the bracket operator. By default, extracting a single column from a data.frame returns a vector (not a data.frame), while performing the same operation on a tibble returns another tibble. The key to getting a data.frame when extracting a column from a data.frame is to add the argument drop = FALSE. The $ operator always returns a vector.
## convert tibble to data.framedatdf <-as.data.frame(dat)## bracket subsetting of data.frame -> vectorhead(datdf[, "species"]) # vector
What is the value of the variable island for the penguin in the 83rd row of dat?
Solution 1
dat[83, "island"]
# A tibble: 1 × 1
island
<chr>
1 Torgersen
dat |>slice(83) |>select(island)
# A tibble: 1 × 1
island
<chr>
1 Torgersen
Exercise 2
Create a new object containing only the data in the last row of dat.
Solution 2
(datlast <- dat[nrow(dat), ])
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinst… Dream 50.2 18.7 198 3775
# ℹ 2 more variables: sex <chr>, year <dbl>
(datlast <- dat |>slice(nrow(dat)))
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinst… Dream 50.2 18.7 198 3775
# ℹ 2 more variables: sex <chr>, year <dbl>
Subsetting based on observed values
In the examples above, we selected the rows to keep manually. Another common way to subset data sets is using the observed values. For example, let’s say we only want to keep records from 2007. In the tidyverse, this is achieved by the filter function.
dat |>filter(year ==2007)
# A tibble: 110 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 100 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
Implicitly, the filtering here is performed via a logical vector, generated by the logical expression year == 2007. The filter function will look for a column named year in the tibble, and the expression will generate a vector with values that are TRUE for all rows corresponding to records from 2007, and FALSE otherwise. The subsetting is then performed by only returning rows corresponding to TRUE values. We can combine multiple logical expressions. Consider the following illustrative examples:
## "and" -> TRUE where both vectors are TRUEc(TRUE, FALSE, TRUE) &c(TRUE, TRUE, FALSE)
[1] TRUE FALSE FALSE
## "or" -> TRUE where at least one vector is TRUEc(TRUE, FALSE, TRUE) |c(TRUE, TRUE, FALSE)
[1] TRUE TRUE TRUE
For example, to only retain records from 2007 and from the ‘Torgersen’ island:
A common application of this process is to remove rows with missing values for some column. The function is.na() will return TRUE if a value is missing, and FALSE otherwise. We can start by getting an overview of the number of missing values in each column.
Subset dat to only penguins of species “Adelie” and flipper length above 190 mm. How many such records are there? What is the average body mass for these penguins?
Solution 4
dat_al <- dat |>filter(species =="Adelie"& flipper_length_mm >190)nrow(dat_al)
[1] 67
dat_al |>pull(body_mass_g) |>mean(na.rm =TRUE)
[1] 3891.045
Exercise 5
Create a new tibble called dat_no_na by removing the rows with missing values. Check the summary again to make sure that no missing values remain.
Solution 5
dat_no_na <- dat |>filter(!is.na(bill_length_mm) &!is.na(bill_depth_mm) &!is.na(flipper_length_mm) &!is.na(body_mass_g) &!is.na(sex))summary(is.na(dat_no_na))
species island bill_length_mm bill_depth_mm
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:333 FALSE:333 FALSE:333 FALSE:333
flipper_length_mm body_mass_g sex year
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:333 FALSE:333 FALSE:333 FALSE:333
# alternative, using a helper function from tidyrdat_no_na_2 <- dat |>drop_na()summary(is.na(dat_no_na_2))
species island bill_length_mm bill_depth_mm
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:333 FALSE:333 FALSE:333 FALSE:333
flipper_length_mm body_mass_g sex year
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:333 FALSE:333 FALSE:333 FALSE:333
## check that objects are equalall.equal(dat_no_na, dat_no_na_2, check.attributes =FALSE)
[1] TRUE
## another nice way to compare objectswaldo::compare(dat_no_na, dat_no_na_2, ignore_attr =TRUE)
✔ No differences
Exercise 6
Subset dat_no_na to only male penguins of the ‘Adelie’ species. How many such records are there? Calculate the average value of their flipper length.
Solution 6
dat_am <- dat_no_na |>filter(species =="Adelie"& sex =="male")nrow(dat_am)
[1] 73
dat_am |>pull(flipper_length_mm) |>mean()
[1] 192.411
Exercise 7
Subset dat_no_na to only male penguins of either the Chinstrap or Gentoo species.
Solution 7
dat_no_na |>filter(species %in%c("Chinstrap", "Gentoo") & sex =="male")
Create a new tibble by keeping the rows with missing values. Do they have something in common?
Solution 8
dat_na <- dat |>filter(is.na(bill_length_mm) |is.na(bill_depth_mm) |is.na(flipper_length_mm) |is.na(body_mass_g) |is.na(sex))## There's nothing obvious in common between all these recordsdat_na |>as.data.frame()
species island bill_length_mm bill_depth_mm flipper_length_mm
1 Adelie Torgersen NA NA NA
2 Adelie Torgersen 34.1 18.1 193
3 Adelie Torgersen 42.0 20.2 190
4 Adelie Torgersen 37.8 17.1 186
5 Adelie Torgersen 37.8 17.3 180
6 Adelie Dream 37.5 18.9 179
7 Gentoo Biscoe 44.5 14.3 216
8 Gentoo Biscoe 46.2 14.4 214
9 Gentoo Biscoe 47.3 13.8 216
10 Gentoo Biscoe 44.5 15.7 217
11 Gentoo Biscoe NA NA NA
body_mass_g sex year
1 NA <NA> 2007
2 3475 <NA> 2007
3 4250 <NA> 2007
4 3300 <NA> 2007
5 3700 <NA> 2007
6 2975 <NA> 2007
7 4100 <NA> 2007
8 4650 <NA> 2008
9 4725 <NA> 2009
10 4875 <NA> 2009
11 NA <NA> 2009
Adding a new column to a data set, or replacing an existing column
To add a new column, or to replace an existing column, we can use the mutate function from dplyr. As before, we can use the pipe operator to apply this function to our tibble:
# add a column with all values equal to 1dat |>mutate(new_column =1)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 4 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>,
# new_column <dbl>
# replace the sex column by the first letter onlydat |>mutate(sex =substr(sex, 1, 1))
# A tibble: 344 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
Exercise 9
Add another column to dat that contains the sum of the bill_length_mm and bill_depth_mm columns.
Solution 9
dat <- dat |>mutate(lpd = bill_length_mm + bill_depth_mm)
Exercise 10
Create another column in dat that contains the bill length in cm. Display only the species column and the newly created column.
Solution 10
dat |>mutate(bill_length_cm = bill_length_mm /10) |>select(species, bill_length_cm)
The distinct() function returns all distinct rows in a data.frame or tibble. We can also supply a set of columns, and dplyr will first select those columns.
dat |>distinct(species, island)
# A tibble: 5 × 2
species island
<chr> <chr>
1 Adelie Torgersen
2 Adelie Biscoe
3 Adelie Dream
4 Gentoo Biscoe
5 Chinstrap Dream
We can also count the number of observations for each of the distinct combinations.
dat |>count(species, island)
# A tibble: 5 × 3
species island n
<chr> <chr> <int>
1 Adelie Biscoe 44
2 Adelie Dream 56
3 Adelie Torgersen 52
4 Chinstrap Dream 68
5 Gentoo Biscoe 124
Exercise 11
Verify using filter that there are no records for Chinstrap penguins on the Biscoe island.
Solution 11
dat |>filter(species =="Chinstrap"& island =="Biscoe")
# A tibble: 0 × 9
# ℹ 9 variables: species <chr>, island <chr>, bill_length_mm <dbl>,
# bill_depth_mm <dbl>, flipper_length_mm <dbl>, body_mass_g <dbl>,
# sex <chr>, year <dbl>, lpd <dbl>