In this lecture we will continue on the use of tidyverse for exploratory data analysis. We will focus on ways of arranging data sets in particular order and on calculating summary statistics for data subsets.
Exercise
Create a script (a regular .R script, an .Rmd R Markdown file, or a *.qmd Quarto file) that will be used to save all the code that we will write today.
We first load tidyverse, which is a package bundle that will load several specialized packages for data analysis.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We use the Palmer Penguins data set from the previous lectures. Assuming the location of the data file is in the data sub-directory of your current working directory (data/penguins.csv), we can load it with:
dat <- readr::read_csv("data/penguins.csv")
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Short recap
The previous lecture introduced several commands to subset and modify the data. We will shortly recapitulate some of them by examples.
Example 1
What data types are in our data set?
str(dat)
spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
$ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
$ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
$ sex : chr [1:344] "male" "female" "female" NA ...
$ year : num [1:344] 2007 2007 2007 2007 2007 ...
- attr(*, "spec")=
.. cols(
.. species = col_character(),
.. island = col_character(),
.. bill_length_mm = col_double(),
.. bill_depth_mm = col_double(),
.. flipper_length_mm = col_double(),
.. body_mass_g = col_double(),
.. sex = col_character(),
.. year = col_double()
.. )
- attr(*, "problems")=<externalptr>
Example 2
Over what period of time are penguins observed?
years <-select(dat, year)distinct(years)
# A tibble: 3 × 1
year
<dbl>
1 2007
2 2008
3 2009
or more conveniently using the pipe-operator (|>):
dat |>select(year) |>distinct()
# A tibble: 3 × 1
year
<dbl>
1 2007
2 2008
3 2009
Example 3
Select the first 3 cases observed in 2007.
dat |>filter(year ==2007) |>slice(1:3)
# A tibble: 3 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 39.5 17.4 186 3800
3 Adelie Torge… 40.3 18 195 3250
# ℹ 2 more variables: sex <chr>, year <dbl>
Example 4
Which animals are “exceptionally” big? We call exceptional, if an animal is bigger than the average body mass plus two times its standard deviation:
dat |>drop_na(body_mass_g) |>mutate(big = body_mass_g >mean(body_mass_g) +2*sd(body_mass_g)) |>filter(big)
# A tibble: 5 × 3
island species n
<chr> <chr> <int>
1 Biscoe Adelie 44
2 Biscoe Gentoo 124
3 Dream Adelie 56
4 Dream Chinstrap 68
5 Torgersen Adelie 52
The count() example underlies a more general principle: apply summaries on data subsets defined by some grouping variable(s), in this case island and species. You will hear more on this common task below.
Order data sets with arrange()
The arrange() command orders the rows of our tibble according to some selected variable(s). Lets define a small data summary in order to highlight a few use cases for arrange().
sub_dat <- dat |>drop_na(sex) |>count(island, species, sex)sub_dat
# A tibble: 10 × 4
island species sex n
<chr> <chr> <chr> <int>
1 Biscoe Adelie female 22
2 Biscoe Adelie male 22
3 Biscoe Gentoo female 58
4 Biscoe Gentoo male 61
5 Dream Adelie female 27
6 Dream Adelie male 28
7 Dream Chinstrap female 34
8 Dream Chinstrap male 34
9 Torgersen Adelie female 24
10 Torgersen Adelie male 23
We can arrange the data by animal count (n) with:
sub_dat |>arrange(n)
# A tibble: 10 × 4
island species sex n
<chr> <chr> <chr> <int>
1 Biscoe Adelie female 22
2 Biscoe Adelie male 22
3 Torgersen Adelie male 23
4 Torgersen Adelie female 24
5 Dream Adelie female 27
6 Dream Adelie male 28
7 Dream Chinstrap female 34
8 Dream Chinstrap male 34
9 Biscoe Gentoo female 58
10 Biscoe Gentoo male 61
By default numerical variables are sorted from smallest to largest value. Sorting in descending order is achieved by wrapping the sort variable with desc():
sub_dat |>arrange(desc(n))
# A tibble: 10 × 4
island species sex n
<chr> <chr> <chr> <int>
1 Biscoe Gentoo male 61
2 Biscoe Gentoo female 58
3 Dream Chinstrap female 34
4 Dream Chinstrap male 34
5 Dream Adelie male 28
6 Dream Adelie female 27
7 Torgersen Adelie female 24
8 Torgersen Adelie male 23
9 Biscoe Adelie female 22
10 Biscoe Adelie male 22
Its possible to arrange the data by several variables at once:
sub_dat |>arrange(island, species, desc(n))
# A tibble: 10 × 4
island species sex n
<chr> <chr> <chr> <int>
1 Biscoe Adelie female 22
2 Biscoe Adelie male 22
3 Biscoe Gentoo male 61
4 Biscoe Gentoo female 58
5 Dream Adelie male 28
6 Dream Adelie female 27
7 Dream Chinstrap female 34
8 Dream Chinstrap male 34
9 Torgersen Adelie female 24
10 Torgersen Adelie male 23
In the example, the data is first sorted alphabetically on island, within the same island on species, and within each island-species combination on count in descending order.
Sometimes we would like to arrange categorical data in a particular order. Categorical variables are best represented as factors in R where the order of categories (=levels) can be freely chosen. This allows to arrange the data in the order of the categories.
# A tibble: 10 × 4
island species sex n
<fct> <chr> <chr> <int>
1 Dream Adelie female 27
2 Dream Adelie male 28
3 Torgersen Adelie female 24
4 Torgersen Adelie male 23
5 Biscoe Adelie female 22
6 Biscoe Adelie male 22
7 Dream Chinstrap female 34
8 Dream Chinstrap male 34
9 Biscoe Gentoo female 58
10 Biscoe Gentoo male 61
Exercise
Use arrange() and slice() to get the male animal from Biscoe island with the largest flipper length.
dat |>arrange(island, desc(sex), desc(flipper_length_mm)) |>slice(1)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 54.3 15.7 231 5650
# ℹ 2 more variables: sex <chr>, year <dbl>
More on slicing data
The slice() command selects rows by index, but there are more variants of this command. First, slice_head() and slice_tail() lets you select the first or last cases of the data, respectively. Here we select the first two and last two:
dat |>slice_head(n =2)
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
2 Adelie Torge… 39.5 17.4 186 3800
# ℹ 2 more variables: sex <chr>, year <dbl>
dat |>slice_tail(n =2)
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinst… Dream 50.8 19 210 4100
2 Chinst… Dream 50.2 18.7 198 3775
# ℹ 2 more variables: sex <chr>, year <dbl>
Two more slice variants aim at selecting rows with the smallest or largest value for a given variable. Here we select cases with the smallest and the largest bill length, respectively.
dat |>slice_min(bill_length_mm)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Dream 32.1 15.5 188 3050
# ℹ 2 more variables: sex <chr>, year <dbl>
dat |>slice_max(bill_length_mm)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 59.6 17 230 6050
# ℹ 2 more variables: sex <chr>, year <dbl>
These versions of slice are similar to a combination of arrange() and slice_head() or slice_tail().
dat |>arrange(desc(bill_length_mm)) |>slice_head()
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe 59.6 17 230 6050
# ℹ 2 more variables: sex <chr>, year <dbl>
However, be careful when the ordering variable contains missing values (NA) as these are ordered at the end.
dat |>arrange(bill_length_mm) |>slice_tail()
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gentoo Biscoe NA NA NA NA
# ℹ 2 more variables: sex <chr>, year <dbl>
Exercise
Explain why these two expressions return different results!
dat |>arrange(year) |>slice_head()
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torge… 39.1 18.7 181 3750
# ℹ 2 more variables: sex <chr>, year <dbl>
dat |>slice_min(year)
# A tibble: 110 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 100 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
A: slice_min() and slice_max() may return more rows than requested in the presence of ties.
Exercise
From the Adelie penguins select the case(s) with the smallest body mass. How many are there?
dat |>filter(species =="Adelie") |>slice_min(body_mass_g)
# A tibble: 2 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Biscoe 36.5 16.6 181 2850
2 Adelie Biscoe 36.4 17.1 184 2850
# ℹ 2 more variables: sex <chr>, year <dbl>
Split-apply-combine data analysis
Often we would like to perform operations on subsets of the data, e.g., calculate the mean body mass per species. This can be done by splitting the data into subsets as defined by a grouping variable (species), calculate the summary per subset (mean body mass), and combine the results into a new data set.
Define data subsets with group_by()
At the base of this “split-apply-combine” strategy is the grouping of our data which works via the group_by() command:
dat |>group_by(species)
# A tibble: 344 × 8
# Groups: species [3]
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
group_by() doesn’t perform any computation, it just returns the data grouped into the subsets defined by the grouping variable, in our case its by species. We can also group our data by several variables, e.g. species and island:
dat |>group_by(species, island)
# A tibble: 344 × 8
# Groups: species, island [5]
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
Note, there are 5 groups in total, i.e., the combination of island and species led to 5 distinct subsets. Combinations that don’t appear in the data such as the Gentoo penguins on Torgersen island are dropped by default.
Exercise
Group the data by species and island. What do the following functions tell you about the data grouping?
group_keys()
group_indices()
grp_dat <- dat |>group_by(species, island)grp_dat |>group_keys()
# A tibble: 5 × 2
species island
<chr> <chr>
1 Adelie Biscoe
2 Adelie Dream
3 Adelie Torgersen
4 Chinstrap Dream
5 Gentoo Biscoe
# A tibble: 344 × 8
# Groups: island [3]
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>
A: By default any existing grouping will be overwritten by a new group_by() command.
Operations on data subsets
The important difference of a grouped tibble to an un-grouped tibble is that subsequent operations are applied to the groups (when applicable) instead of the whole data set. Compare the results of these two pipelines, first un-grouped, then grouped:
dat |>slice_min(body_mass_g)
# A tibble: 1 × 8
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Chinst… Dream 46.9 16.6 192 2700
# ℹ 2 more variables: sex <chr>, year <dbl>
dat |>group_by(species, island) |>slice_min(body_mass_g)
You can unset the grouping with ungroup() which can be useful if you want to perform un-grouped operations afterwards. In the following example we use the n() command, which gives you the current group size, to count the number of animals per island. We then un-group the data to calculate the total number of animals (=rows).
animal_per_island <- dat |>group_by(island) |>mutate(n_island =n()) |>ungroup() |>mutate(n_total =n()) animal_per_island
# A tibble: 344 × 10
species island bill_length_mm bill_depth_mm flipper_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 181
2 Adelie Torgersen 39.5 17.4 186
3 Adelie Torgersen 40.3 18 195
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 193
6 Adelie Torgersen 39.3 20.6 190
7 Adelie Torgersen 38.9 17.8 181
8 Adelie Torgersen 39.2 19.6 195
9 Adelie Torgersen 34.1 18.1 193
10 Adelie Torgersen 42 20.2 190
# ℹ 334 more rows
# ℹ 5 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>,
# n_island <int>, n_total <int>
Adding a selection of distinct values for animal counts per island returns a more meaningful summary:
# A tibble: 3 × 3
species avg mean
<chr> <dbl> <dbl>
1 Adelie 3676. 3701.
2 Chinstrap 3733. 3733.
3 Gentoo 5035. 5076.
Exercise
Calculate the range (=difference between maximal and minimal) body mass per species and sex. Again, we neglect all cases where sex is undetermined.
dat |>drop_na(sex) |>group_by(species, sex) |>summarize(min =min(body_mass_g),max =max(body_mass_g),range = max - min)
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 6 × 5
# Groups: species [3]
species sex min max range
<chr> <chr> <dbl> <dbl> <dbl>
1 Adelie female 2850 3900 1050
2 Adelie male 3325 4775 1450
3 Chinstrap female 2700 4150 1450
4 Chinstrap male 3250 4800 1550
5 Gentoo female 3950 5200 1250
6 Gentoo male 4750 6300 1550
Exercise
Similar to the example in the beginning, we define exceptionally big animals if their body mass is bigger than the average body mass plus two times its standard deviation. However this time mean and standard deviation is defined per sex and species. Use a combination of group_by(), mutate() and filter() to get the biggest animals. Also, exclude all animals with undetermined sex.
dat |>drop_na(sex) |>group_by(species, sex) |>mutate(big = body_mass_g >mean(body_mass_g) +2*sd(body_mass_g)) |>filter(big)
# A tibble: 4 × 9
# Groups: species, sex [4]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Adelie Biscoe 43.2 19 197 4775
2 Gentoo Biscoe 49.2 15.2 221 6300
3 Chinst… Dream 46 18.9 195 4150
4 Chinst… Dream 52 20.7 210 4800
# ℹ 3 more variables: sex <chr>, year <dbl>, big <lgl>
As mentioned above, a special feature of the summary() command is that it will “peel-off” one level of the grouping.
grp_dat <- dat |>group_by(species, island) groups(grp_dat)
[[1]]
species
[[2]]
island
grp_dat |>summarize(n =n()) |>groups()
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
list()
We make use of this feature in the example below where we calculate the fraction of species per island by first counting animals per island and species (“deepest” summary) and then summing animals over the remaining group (island).
dat |>group_by(island, species) |>summarize(n =n()) |>mutate(frac = n/sum(n))
`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.
# A tibble: 5 × 4
# Groups: island [3]
island species n frac
<chr> <chr> <int> <dbl>
1 Biscoe Adelie 44 0.262
2 Biscoe Gentoo 124 0.738
3 Dream Adelie 56 0.452
4 Dream Chinstrap 68 0.548
5 Torgersen Adelie 52 1
Exercise
Calculate the faction of males per species and island. As in the previous exercise, exclude all animals with undetermined sex.
dat |>drop_na(sex) |>group_by(species, island, sex) |>summarize(n =n()) |>mutate(total =sum(n),frac = n/total)
`summarise()` has grouped output by 'species', 'island'. You can override
using the `.groups` argument.
# A tibble: 10 × 6
# Groups: species, island [5]
species island sex n total frac
<chr> <chr> <chr> <int> <int> <dbl>
1 Adelie Biscoe female 22 44 0.5
2 Adelie Biscoe male 22 44 0.5
3 Adelie Dream female 27 55 0.491
4 Adelie Dream male 28 55 0.509
5 Adelie Torgersen female 24 47 0.511
6 Adelie Torgersen male 23 47 0.489
7 Chinstrap Dream female 34 68 0.5
8 Chinstrap Dream male 34 68 0.5
9 Gentoo Biscoe female 58 119 0.487
10 Gentoo Biscoe male 61 119 0.513