tidyverse: arrange, slice, group_by and summarize

Author

Florian Geier

Published

October 9, 2024

Introduction

In this lecture we will continue on the use of tidyverse for exploratory data analysis. We will focus on ways of arranging data sets in particular order and on calculating summary statistics for data subsets.

Exercise

Create a script (a regular .R script, an .Rmd R Markdown file, or a *.qmd Quarto file) that will be used to save all the code that we will write today.

We first load tidyverse, which is a package bundle that will load several specialized packages for data analysis.

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

We use the Palmer Penguins data set from the previous lectures. Assuming the location of the data file is in the data sub-directory of your current working directory (data/penguins.csv), we can load it with:

dat <- readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Short recap

The previous lecture introduced several commands to subset and modify the data. We will shortly recapitulate some of them by examples.

Example 1

What data types are in our data set?

str(dat)

spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 NA 3450 ...
 $ sex              : chr [1:344] "male" "female" "female" NA ...
 $ year             : num [1:344] 2007 2007 2007 2007 2007 ...
 - attr(*, "spec")=
  .. cols(
  ..   species = col_character(),
  ..   island = col_character(),
  ..   bill_length_mm = col_double(),
  ..   bill_depth_mm = col_double(),
  ..   flipper_length_mm = col_double(),
  ..   body_mass_g = col_double(),
  ..   sex = col_character(),
  ..   year = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Example 2

Over what period of time are penguins observed?

years <- select(dat, year)
distinct(years)

# A tibble: 3 × 1
   year
  <dbl>
1  2007
2  2008
3  2009

or more conveniently using the pipe-operator (|>):

dat |>
  select(year) |>
  distinct()

# A tibble: 3 × 1
   year
  <dbl>
1  2007
2  2008
3  2009

Example 3

Select the first 3 cases observed in 2007.

dat |>
  filter(year == 2007) |>
  slice(1:3)

# A tibble: 3 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           39.5          17.4               186        3800
3 Adelie  Torge…           40.3          18                 195        3250
# ℹ 2 more variables: sex <chr>, year <dbl>

Example 4

Which animals are “exceptionally” big? We call exceptional, if an animal is bigger than the average body mass plus two times its standard deviation:

dat |>
  drop_na(body_mass_g) |> 
  mutate(big = body_mass_g > mean(body_mass_g) + 2*sd(body_mass_g)) |>
  filter(big)

# A tibble: 9 × 9
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Gentoo  Biscoe           48.4          14.6               213        5850
2 Gentoo  Biscoe           49.3          15.7               217        5850
3 Gentoo  Biscoe           49.2          15.2               221        6300
4 Gentoo  Biscoe           59.6          17                 230        6050
5 Gentoo  Biscoe           51.1          16.3               220        6000
6 Gentoo  Biscoe           45.2          16.4               223        5950
7 Gentoo  Biscoe           49.8          15.9               229        5950
8 Gentoo  Biscoe           55.1          16                 230        5850
9 Gentoo  Biscoe           48.8          16.2               222        6000
# ℹ 3 more variables: sex <chr>, year <dbl>, big <lgl>

Example 5

How many species live on each island?

dat |>
  count(island, species)

# A tibble: 5 × 3
  island    species       n
  <chr>     <chr>     <int>
1 Biscoe    Adelie       44
2 Biscoe    Gentoo      124
3 Dream     Adelie       56
4 Dream     Chinstrap    68
5 Torgersen Adelie       52

The count() example underlies a more general principle: apply summaries on data subsets defined by some grouping variable(s), in this case island and species. You will hear more on this common task below.

Order data sets with `arrange()`

The arrange() command orders the rows of our tibble according to some selected variable(s). Lets define a small data summary in order to highlight a few use cases for arrange().

sub_dat <- dat |>
  drop_na(sex) |>
  count(island, species, sex)
sub_dat

# A tibble: 10 × 4
   island    species   sex        n
   <chr>     <chr>     <chr>  <int>
 1 Biscoe    Adelie    female    22
 2 Biscoe    Adelie    male      22
 3 Biscoe    Gentoo    female    58
 4 Biscoe    Gentoo    male      61
 5 Dream     Adelie    female    27
 6 Dream     Adelie    male      28
 7 Dream     Chinstrap female    34
 8 Dream     Chinstrap male      34
 9 Torgersen Adelie    female    24
10 Torgersen Adelie    male      23

We can arrange the data by animal count (n) with:

sub_dat |>
  arrange(n)

# A tibble: 10 × 4
   island    species   sex        n
   <chr>     <chr>     <chr>  <int>
 1 Biscoe    Adelie    female    22
 2 Biscoe    Adelie    male      22
 3 Torgersen Adelie    male      23
 4 Torgersen Adelie    female    24
 5 Dream     Adelie    female    27
 6 Dream     Adelie    male      28
 7 Dream     Chinstrap female    34
 8 Dream     Chinstrap male      34
 9 Biscoe    Gentoo    female    58
10 Biscoe    Gentoo    male      61

By default numerical variables are sorted from smallest to largest value. Sorting in descending order is achieved by wrapping the sort variable with desc():

sub_dat |>
  arrange(desc(n))

# A tibble: 10 × 4
   island    species   sex        n
   <chr>     <chr>     <chr>  <int>
 1 Biscoe    Gentoo    male      61
 2 Biscoe    Gentoo    female    58
 3 Dream     Chinstrap female    34
 4 Dream     Chinstrap male      34
 5 Dream     Adelie    male      28
 6 Dream     Adelie    female    27
 7 Torgersen Adelie    female    24
 8 Torgersen Adelie    male      23
 9 Biscoe    Adelie    female    22
10 Biscoe    Adelie    male      22

Its possible to arrange the data by several variables at once:

sub_dat |>
  arrange(island, species, desc(n))

# A tibble: 10 × 4
   island    species   sex        n
   <chr>     <chr>     <chr>  <int>
 1 Biscoe    Adelie    female    22
 2 Biscoe    Adelie    male      22
 3 Biscoe    Gentoo    male      61
 4 Biscoe    Gentoo    female    58
 5 Dream     Adelie    male      28
 6 Dream     Adelie    female    27
 7 Dream     Chinstrap female    34
 8 Dream     Chinstrap male      34
 9 Torgersen Adelie    female    24
10 Torgersen Adelie    male      23

In the example, the data is first sorted alphabetically on island, within the same island on species, and within each island-species combination on count in descending order.

Sometimes we would like to arrange categorical data in a particular order. Categorical variables are best represented as factors in R where the order of categories (=levels) can be freely chosen. This allows to arrange the data in the order of the categories.

fact_dat <- sub_dat |>
  mutate(species = factor(species, levels = c('Chinstrap','Adelie','Gentoo'))) 
str(fact_dat)

tibble [10 × 4] (S3: tbl_df/tbl/data.frame)
 $ island : chr [1:10] "Biscoe" "Biscoe" "Biscoe" "Biscoe" ...
 $ species: Factor w/ 3 levels "Chinstrap","Adelie",..: 2 2 3 3 2 2 1 1 2 2
 $ sex    : chr [1:10] "female" "male" "female" "male" ...
 $ n      : int [1:10] 22 22 58 61 27 28 34 34 24 23

fact_dat |>
  arrange(species)

# A tibble: 10 × 4
   island    species   sex        n
   <chr>     <fct>     <chr>  <int>
 1 Dream     Chinstrap female    34
 2 Dream     Chinstrap male      34
 3 Biscoe    Adelie    female    22
 4 Biscoe    Adelie    male      22
 5 Dream     Adelie    female    27
 6 Dream     Adelie    male      28
 7 Torgersen Adelie    female    24
 8 Torgersen Adelie    male      23
 9 Biscoe    Gentoo    female    58
10 Biscoe    Gentoo    male      61

Exercise

Arrange sub_dat by species and with in each species by the following order of islands: 1. Dream, 2. Torgersen, 3. Biscoe.

sub_dat |>
  mutate(island = factor(island, levels = c('Dream', 'Torgersen', 'Biscoe'))) |>
  arrange(species, island)

# A tibble: 10 × 4
   island    species   sex        n
   <fct>     <chr>     <chr>  <int>
 1 Dream     Adelie    female    27
 2 Dream     Adelie    male      28
 3 Torgersen Adelie    female    24
 4 Torgersen Adelie    male      23
 5 Biscoe    Adelie    female    22
 6 Biscoe    Adelie    male      22
 7 Dream     Chinstrap female    34
 8 Dream     Chinstrap male      34
 9 Biscoe    Gentoo    female    58
10 Biscoe    Gentoo    male      61

Exercise

Use arrange() and slice() to get the male animal from Biscoe island with the largest flipper length.

dat |>
  arrange(island, desc(sex), desc(flipper_length_mm)) |>
  slice(1)

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Gentoo  Biscoe           54.3          15.7               231        5650
# ℹ 2 more variables: sex <chr>, year <dbl>

More on slicing data

The slice() command selects rows by index, but there are more variants of this command. First, slice_head() and slice_tail() lets you select the first or last cases of the data, respectively. Here we select the first two and last two:

dat |>
  slice_head(n = 2)

# A tibble: 2 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
2 Adelie  Torge…           39.5          17.4               186        3800
# ℹ 2 more variables: sex <chr>, year <dbl>

dat |> 
  slice_tail(n = 2)

# A tibble: 2 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Chinst… Dream            50.8          19                 210        4100
2 Chinst… Dream            50.2          18.7               198        3775
# ℹ 2 more variables: sex <chr>, year <dbl>

Two more slice variants aim at selecting rows with the smallest or largest value for a given variable. Here we select cases with the smallest and the largest bill length, respectively.

dat |>
  slice_min(bill_length_mm)

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Dream            32.1          15.5               188        3050
# ℹ 2 more variables: sex <chr>, year <dbl>

dat |>
  slice_max(bill_length_mm)

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Gentoo  Biscoe           59.6            17               230        6050
# ℹ 2 more variables: sex <chr>, year <dbl>

These versions of slice are similar to a combination of arrange() and slice_head() or slice_tail().

dat |>
  arrange(desc(bill_length_mm)) |>    
  slice_head()

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Gentoo  Biscoe           59.6            17               230        6050
# ℹ 2 more variables: sex <chr>, year <dbl>

However, be careful when the ordering variable contains missing values (NA) as these are ordered at the end.

dat |>
  arrange(bill_length_mm) |>    
  slice_tail()

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Gentoo  Biscoe             NA            NA                NA          NA
# ℹ 2 more variables: sex <chr>, year <dbl>

Exercise

Explain why these two expressions return different results!

dat |>
  arrange(year) |>    
  slice_head()

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torge…           39.1          18.7               181        3750
# ℹ 2 more variables: sex <chr>, year <dbl>

dat |>
  slice_min(year)

# A tibble: 110 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 100 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

A: slice_min() and slice_max() may return more rows than requested in the presence of ties.

Exercise

From the Adelie penguins select the case(s) with the smallest body mass. How many are there?

dat |>
  filter(species == "Adelie") |>
  slice_min(body_mass_g)

# A tibble: 2 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Biscoe           36.5          16.6               181        2850
2 Adelie  Biscoe           36.4          17.1               184        2850
# ℹ 2 more variables: sex <chr>, year <dbl>

Split-apply-combine data analysis

Often we would like to perform operations on subsets of the data, e.g., calculate the mean body mass per species. This can be done by splitting the data into subsets as defined by a grouping variable (species), calculate the summary per subset (mean body mass), and combine the results into a new data set.

Define data subsets with `group_by()`

At the base of this “split-apply-combine” strategy is the grouping of our data which works via the group_by() command:

dat |>
  group_by(species)

# A tibble: 344 × 8
# Groups:   species [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

group_by() doesn’t perform any computation, it just returns the data grouped into the subsets defined by the grouping variable, in our case its by species. We can also group our data by several variables, e.g. species and island:

dat |>
  group_by(species, island)

# A tibble: 344 × 8
# Groups:   species, island [5]
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Note, there are 5 groups in total, i.e., the combination of island and species led to 5 distinct subsets. Combinations that don’t appear in the data such as the Gentoo penguins on Torgersen island are dropped by default.

Exercise

Group the data by species and island. What do the following functions tell you about the data grouping?

group_keys()
group_indices()

grp_dat <- dat |>
  group_by(species, island)

grp_dat |> group_keys()

# A tibble: 5 × 2
  species   island   
  <chr>     <chr>    
1 Adelie    Biscoe   
2 Adelie    Dream    
3 Adelie    Torgersen
4 Chinstrap Dream    
5 Gentoo    Biscoe

grp_dat |> group_indices()

  [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
 [36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
 [71] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1
[106] 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
[141] 2 2 2 2 2 2 2 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[176] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[211] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
[246] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 4 4 4 4
[281] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
[316] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Exercise

What happens in the following case?

dat |>
  group_by(species) |>
  group_by(island)

# A tibble: 344 × 8
# Groups:   island [3]
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

A: By default any existing grouping will be overwritten by a new group_by() command.

Operations on data subsets

The important difference of a grouped tibble to an un-grouped tibble is that subsequent operations are applied to the groups (when applicable) instead of the whole data set. Compare the results of these two pipelines, first un-grouped, then grouped:

dat |>
  slice_min(body_mass_g)

# A tibble: 1 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Chinst… Dream            46.9          16.6               192        2700
# ℹ 2 more variables: sex <chr>, year <dbl>

dat |>
  group_by(species, island) |>
  slice_min(body_mass_g)

# A tibble: 6 × 8
# Groups:   species, island [5]
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Biscoe           36.5          16.6               181        2850
2 Adelie  Biscoe           36.4          17.1               184        2850
3 Adelie  Dream            33.1          16.1               178        2900
4 Adelie  Torge…           38.6          17                 188        2900
5 Chinst… Dream            46.9          16.6               192        2700
6 Gentoo  Biscoe           42.7          13.7               208        3950
# ℹ 2 more variables: sex <chr>, year <dbl>

Exercise

For each species and sex pull out the cases with the largest flipper length.

dat |>
  drop_na(sex) |>
  group_by(species, sex) |>
  slice_max(flipper_length_mm)

# A tibble: 6 × 8
# Groups:   species, sex [6]
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Dream            35.7          18                 202        3550
2 Adelie  Torge…           44.1          18                 210        4000
3 Chinst… Dream            43.5          18.1               202        3400
4 Chinst… Dream            49            19.6               212        4300
5 Gentoo  Biscoe           46.9          14.6               222        4875
6 Gentoo  Biscoe           54.3          15.7               231        5650
# ℹ 2 more variables: sex <chr>, year <dbl>

You can unset the grouping with ungroup() which can be useful if you want to perform un-grouped operations afterwards. In the following example we use the n() command, which gives you the current group size, to count the number of animals per island. We then un-group the data to calculate the total number of animals (=rows).

animal_per_island <- dat |>
  group_by(island) |>
  mutate(n_island = n()) |>
  ungroup() |> 
  mutate(n_total = n()) 

animal_per_island

# A tibble: 344 × 10
   species island    bill_length_mm bill_depth_mm flipper_length_mm
   <chr>   <chr>              <dbl>         <dbl>             <dbl>
 1 Adelie  Torgersen           39.1          18.7               181
 2 Adelie  Torgersen           39.5          17.4               186
 3 Adelie  Torgersen           40.3          18                 195
 4 Adelie  Torgersen           NA            NA                  NA
 5 Adelie  Torgersen           36.7          19.3               193
 6 Adelie  Torgersen           39.3          20.6               190
 7 Adelie  Torgersen           38.9          17.8               181
 8 Adelie  Torgersen           39.2          19.6               195
 9 Adelie  Torgersen           34.1          18.1               193
10 Adelie  Torgersen           42            20.2               190
# ℹ 334 more rows
# ℹ 5 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>,
#   n_island <int>, n_total <int>

Adding a selection of distinct values for animal counts per island returns a more meaningful summary:

animal_per_island |>
  distinct(island, n_island, n_total)

# A tibble: 3 × 3
  island    n_island n_total
  <chr>        <int>   <int>
1 Torgersen       52     344
2 Biscoe         168     344
3 Dream          124     344

Exercise

How differ the results of this workflow from the previous?

dat |>
  mutate(n_total = n()) |>
  group_by(island) |>
  mutate(n_island = n()) |>
  distinct(island, n_island, n_total)

# A tibble: 3 × 3
# Groups:   island [3]
  island    n_island n_total
  <chr>        <int>   <int>
1 Torgersen       52     344
2 Biscoe         168     344
3 Dream          124     344

A: This workflow returns the same data but in a grouped tibble!

Exercise

Calculate the fraction of animals per island.

animal_per_island |>
  mutate(frac = n_island/n_total) |>
  distinct(island, frac)

# A tibble: 3 × 2
  island     frac
  <chr>     <dbl>
1 Torgersen 0.151
2 Biscoe    0.488
3 Dream     0.360

Calculate data summaries with `summarize()`

The group_by() command is often used together with summarize(). The result is a tibble where each row contains the collapsed summary per group.

dat |>
  group_by(species, island) |>
  summarize(min_body_mass = min(body_mass_g))

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
# Groups:   species [3]
  species   island    min_body_mass
  <chr>     <chr>             <dbl>
1 Adelie    Biscoe             2850
2 Adelie    Dream              2900
3 Adelie    Torgersen            NA
4 Chinstrap Dream              2700
5 Gentoo    Biscoe               NA

Caution

Note that summarize() simplifies the output in several ways:

All variables un-related to the grouping and the summary are dropped.
summarize() returns a reduce data set with one row per group. This “modification” is in contrast to mutate() that will preserve all rows.
Finally, the inner most group (island in our case) is dropped from the grouping.

Exercise

Why do we get NA values in summary above? How can we omit them?

dat |>
  group_by(species, island) |>
  summarize(min_body_mass = min(body_mass_g, na.rm = TRUE))

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
# Groups:   species [3]
  species   island    min_body_mass
  <chr>     <chr>             <dbl>
1 Adelie    Biscoe             2850
2 Adelie    Dream              2900
3 Adelie    Torgersen          2900
4 Chinstrap Dream              2700
5 Gentoo    Biscoe             3950

dat |>
  drop_na(body_mass_g) |>
  group_by(species, island) |>
  summarize(min_body_mass = min(body_mass_g))

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
# Groups:   species [3]
  species   island    min_body_mass
  <chr>     <chr>             <dbl>
1 Adelie    Biscoe             2850
2 Adelie    Dream              2900
3 Adelie    Torgersen          2900
4 Chinstrap Dream              2700
5 Gentoo    Biscoe             3950

Now we can replicate the functionality of the count() command using group_by(), summarize() and n().

dat |>
  group_by(island, species) |>
  summarize(n = n()) |>
  ungroup()

`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.

# A tibble: 5 × 3
  island    species       n
  <chr>     <chr>     <int>
1 Biscoe    Adelie       44
2 Biscoe    Gentoo      124
3 Dream     Adelie       56
4 Dream     Chinstrap    68
5 Torgersen Adelie       52

dat |>
  count(island, species)

# A tibble: 5 × 3
  island    species       n
  <chr>     <chr>     <int>
1 Biscoe    Adelie       44
2 Biscoe    Gentoo      124
3 Dream     Adelie       56
4 Dream     Chinstrap    68
5 Torgersen Adelie       52

Exercise

Using the n() and sum() commands calculate the average body mass per species. Check your results when using mean() directly.

dat |>
  group_by(species) |>
  summarize(avg = sum(body_mass_g)/n(),
            mean = mean(body_mass_g))

# A tibble: 3 × 3
  species     avg  mean
  <chr>     <dbl> <dbl>
1 Adelie      NA    NA 
2 Chinstrap 3733. 3733.
3 Gentoo      NA    NA

Exercise

In the exercise above, exclude cases with undetermined body mass and compare again with mean().

dat |>
  drop_na(body_mass_g) |>
  group_by(species) |>
  summarize(avg = sum(body_mass_g)/n(),
            mean = mean(body_mass_g))

# A tibble: 3 × 3
  species     avg  mean
  <chr>     <dbl> <dbl>
1 Adelie    3701. 3701.
2 Chinstrap 3733. 3733.
3 Gentoo    5076. 5076.

## wrong!
dat |>
  group_by(species) |>
  summarize(avg = sum(body_mass_g, na.rm = TRUE)/n(),
            mean = mean(body_mass_g, na.rm = TRUE))

# A tibble: 3 × 3
  species     avg  mean
  <chr>     <dbl> <dbl>
1 Adelie    3676. 3701.
2 Chinstrap 3733. 3733.
3 Gentoo    5035. 5076.

Exercise

Calculate the range (=difference between maximal and minimal) body mass per species and sex. Again, we neglect all cases where sex is undetermined.

dat |>
  drop_na(sex) |>
  group_by(species, sex) |>
  summarize(min = min(body_mass_g),
            max = max(body_mass_g),
            range = max - min)

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

# A tibble: 6 × 5
# Groups:   species [3]
  species   sex      min   max range
  <chr>     <chr>  <dbl> <dbl> <dbl>
1 Adelie    female  2850  3900  1050
2 Adelie    male    3325  4775  1450
3 Chinstrap female  2700  4150  1450
4 Chinstrap male    3250  4800  1550
5 Gentoo    female  3950  5200  1250
6 Gentoo    male    4750  6300  1550

Exercise

Similar to the example in the beginning, we define exceptionally big animals if their body mass is bigger than the average body mass plus two times its standard deviation. However this time mean and standard deviation is defined per sex and species. Use a combination of group_by(), mutate() and filter() to get the biggest animals. Also, exclude all animals with undetermined sex.

dat |>
  drop_na(sex) |>
  group_by(species, sex) |>
  mutate(big = body_mass_g > mean(body_mass_g) + 2*sd(body_mass_g)) |>
  filter(big)

# A tibble: 4 × 9
# Groups:   species, sex [4]
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>           <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Biscoe           43.2          19                 197        4775
2 Gentoo  Biscoe           49.2          15.2               221        6300
3 Chinst… Dream            46            18.9               195        4150
4 Chinst… Dream            52            20.7               210        4800
# ℹ 3 more variables: sex <chr>, year <dbl>, big <lgl>

As mentioned above, a special feature of the summary() command is that it will “peel-off” one level of the grouping.

grp_dat <- dat |>
  group_by(species, island) 

groups(grp_dat)

[[1]]
species

[[2]]
island

grp_dat |>
  summarize(n = n()) |>
  groups()

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

[[1]]
species

grp_dat |> 
  summarize(n = n()) |>
  summarize(n = n()) |>
  groups()

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

list()

We make use of this feature in the example below where we calculate the fraction of species per island by first counting animals per island and species (“deepest” summary) and then summing animals over the remaining group (island).

dat |>
  group_by(island, species) |>
  summarize(n = n()) |>
  mutate(frac = n/sum(n))

`summarise()` has grouped output by 'island'. You can override using the
`.groups` argument.

# A tibble: 5 × 4
# Groups:   island [3]
  island    species       n  frac
  <chr>     <chr>     <int> <dbl>
1 Biscoe    Adelie       44 0.262
2 Biscoe    Gentoo      124 0.738
3 Dream     Adelie       56 0.452
4 Dream     Chinstrap    68 0.548
5 Torgersen Adelie       52 1

Exercise

Calculate the faction of males per species and island. As in the previous exercise, exclude all animals with undetermined sex.

dat |>
  drop_na(sex) |>
  group_by(species, island, sex) |>
  summarize(n = n())  |>
  mutate(total = sum(n),
         frac = n/total)

`summarise()` has grouped output by 'species', 'island'. You can override
using the `.groups` argument.

# A tibble: 10 × 6
# Groups:   species, island [5]
   species   island    sex        n total  frac
   <chr>     <chr>     <chr>  <int> <int> <dbl>
 1 Adelie    Biscoe    female    22    44 0.5  
 2 Adelie    Biscoe    male      22    44 0.5  
 3 Adelie    Dream     female    27    55 0.491
 4 Adelie    Dream     male      28    55 0.509
 5 Adelie    Torgersen female    24    47 0.511
 6 Adelie    Torgersen male      23    47 0.489
 7 Chinstrap Dream     female    34    68 0.5  
 8 Chinstrap Dream     male      34    68 0.5  
 9 Gentoo    Biscoe    female    58   119 0.487
10 Gentoo    Biscoe    male      61   119 0.513

More information

dplyr cheatsheet
Data transformation chapter of ‘R for data science’
dplyr data grouping
dplyr data summary

Session info

R version 4.4.1 Patched (2024-09-30 r87208)
Platform: x86_64-apple-darwin20
Running under: macOS Sonoma 14.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4     
 [5] purrr_1.0.2      readr_2.1.5      tidyr_1.3.1      tibble_3.2.1    
 [9] ggplot2_3.5.1    tidyverse_2.0.0  BiocStyle_2.32.1 png_0.1-8       
[13] knitr_1.48      

loaded via a namespace (and not attached):
 [1] bit_4.5.0           gtable_0.3.5        jsonlite_1.8.9     
 [4] crayon_1.5.3        compiler_4.4.1      BiocManager_1.30.25
 [7] tidyselect_1.2.1    parallel_4.4.1      scales_1.3.0       
[10] yaml_2.3.10         fastmap_1.2.0       R6_2.5.1           
[13] generics_0.1.3      munsell_0.5.1       tzdb_0.4.0         
[16] pillar_1.9.0        rlang_1.1.4         utf8_1.2.4         
[19] stringi_1.8.4       xfun_0.48           bit64_4.5.2        
[22] timechange_0.3.0    cli_3.6.3           withr_3.0.1        
[25] magrittr_2.0.3      digest_0.6.37       vroom_1.6.5        
[28] rstudioapi_0.16.0   hms_1.1.3           lifecycle_1.0.4    
[31] vctrs_0.6.5         evaluate_1.0.0      glue_1.8.0         
[34] fansi_1.0.6         colorspace_2.1-1    rmarkdown_2.28     
[37] tools_4.4.1         pkgconfig_2.0.3     htmltools_0.5.8.1

Introduction

Short recap

Example 1

Example 2

Example 3

Example 4

Example 5

Order data sets with arrange()

More on slicing data

Split-apply-combine data analysis

Define data subsets with group_by()

Operations on data subsets

Calculate data summaries with summarize()

More information

Session info

Order data sets with `arrange()`

Define data subsets with `group_by()`

Calculate data summaries with `summarize()`