tidyverse: joins, pivots, first ggplot2

Author

Michael Stadler

Published

October 16, 2024

Introduction

In the previous lecture we saw how to manipulate data.frames, for example how to select specific rows using slice, or how to summarize data using summarise or group_by.

In this lecture, we will see how to combine multiple data.frames using join-operations, and how to reformat your data into so called long or short format. With these tools we will be well equipped to start exploring our data using ggplot2 (teaser at the end and next session).

Load the tidyverse package

We start by loading the tidyverse package.

library(tidyverse)

── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load data

dat <- readr::read_csv("data/penguins.csv")

Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Joining tables

Quite frequently, we have to combine information from multiple tables into a single one, for example because the data was collected from different sources.

Join operations do this for us, and they do not require that the rows of the combined tables correspond to each other. Instead, the use a key column, which links rows across the tables.

A simple example

Here, we create two tables derived from dat for later joining:

# mean flipper length by species
avg <- dat |>
  group_by(species) |>
  summarize(mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE),
            n = n()) |>
  filter(n > 100)
avg

# A tibble: 2 × 3
  species mean_flipper_length_mm     n
  <chr>                    <dbl> <int>
1 Adelie                    190.   152
2 Gentoo                    217.   124

# random subset of `dat`
set.seed(42L)
datsub <- dat |>
  select(species, island, flipper_length_mm) |>
  slice_sample(n = 10)
datsub

# A tibble: 10 × 3
   species   island    flipper_length_mm
   <chr>     <chr>                 <dbl>
 1 Adelie    Dream                   190
 2 Chinstrap Dream                   196
 3 Gentoo    Biscoe                  211
 4 Adelie    Torgersen               197
 5 Gentoo    Biscoe                  230
 6 Adelie    Dream                   185
 7 Adelie    Torgersen               198
 8 Chinstrap Dream                   198
 9 Adelie    Torgersen               195
10 Chinstrap Dream                   200

The goal is now to add the mean flipper length from avg to the table datsub, using species to link the two:

full_join(datsub, avg)

Joining with `by = join_by(species)`

# A tibble: 10 × 5
   species   island    flipper_length_mm mean_flipper_length_mm     n
   <chr>     <chr>                 <dbl>                  <dbl> <int>
 1 Adelie    Dream                   190                   190.   152
 2 Chinstrap Dream                   196                    NA     NA
 3 Gentoo    Biscoe                  211                   217.   124
 4 Adelie    Torgersen               197                   190.   152
 5 Gentoo    Biscoe                  230                   217.   124
 6 Adelie    Dream                   185                   190.   152
 7 Adelie    Torgersen               198                   190.   152
 8 Chinstrap Dream                   198                    NA     NA
 9 Adelie    Torgersen               195                   190.   152
10 Chinstrap Dream                   200                    NA     NA

Controlling how to link rows

In this exapmle, we did not specify how to link corresponding rows in datsub and avg. By default, full_join(..., by = NULL) will select all columns in common across the two tables and will print a message so we can check if it correctly identified the joining variable(s).

You could use by = join_by(...) to explicitly control our joining:

full_join(datsub, avg, by = join_by(species))

# A tibble: 10 × 5
   species   island    flipper_length_mm mean_flipper_length_mm     n
   <chr>     <chr>                 <dbl>                  <dbl> <int>
 1 Adelie    Dream                   190                   190.   152
 2 Chinstrap Dream                   196                    NA     NA
 3 Gentoo    Biscoe                  211                   217.   124
 4 Adelie    Torgersen               197                   190.   152
 5 Gentoo    Biscoe                  230                   217.   124
 6 Adelie    Dream                   185                   190.   152
 7 Adelie    Torgersen               198                   190.   152
 8 Chinstrap Dream                   198                    NA     NA
 9 Adelie    Torgersen               195                   190.   152
10 Chinstrap Dream                   200                    NA     NA

Different types of join operations

When calculating avg above, we excluded species with less than 100 observations. As a side-effect, avg doesn’t contain the mean flipper length for the species Chinstrap.

The full_join has inserted NA values for such missing data. There are alternative join-functions that behave differently, for example:

inner_join(datsub, avg)

Joining with `by = join_by(species)`

# A tibble: 7 × 5
  species island    flipper_length_mm mean_flipper_length_mm     n
  <chr>   <chr>                 <dbl>                  <dbl> <int>
1 Adelie  Dream                   190                   190.   152
2 Gentoo  Biscoe                  211                   217.   124
3 Adelie  Torgersen               197                   190.   152
4 Gentoo  Biscoe                  230                   217.   124
5 Adelie  Dream                   185                   190.   152
6 Adelie  Torgersen               198                   190.   152
7 Adelie  Torgersen               195                   190.   152

In general, we discriminate four different types of joins:

inner_join is the simplest, which keeps only rows that occur in both tables.

To avoid loosing any observations, we have alternative outer joins that keep rows even if they cannot be linked across tables, filling in the missing values as NAs.

left_join keeps all rows in the first (left) table x
right_join keeps all rows in the second (right) table y
full_join keeps all rows from both tables x and y

This is only a short overview of joining operations. To learn more details, you can look at the Joins section from the open R for Data Science book, from which the above schemes and the venn diagram below have been taken.

In summary, you can think of the join operations also as set operations, that combine the observations (rows) from two sets (tables) and keep a defined subset:

Exercises

Instead of using inner_join(), produce the same result when joining datsub and avg by using full_join() and filter().

# inner join
res1 <- inner_join(datsub, avg)

Joining with `by = join_by(species)`

# full_join and filter
res2 <- full_join(datsub, avg) |>
  filter(!is.na(mean_flipper_length_mm))

Joining with `by = join_by(species)`

# compare results
identical(res1, res2)

[1] TRUE

Use a join and mutate to calculate the difference to the mean flipper length in avg for the observations in datsub.

full_join(datsub, avg, by = join_by(species)) |>
  mutate(diff_to_mean = flipper_length_mm - mean_flipper_length_mm) |>
  select(species, island, diff_to_mean)

# A tibble: 10 × 3
   species   island    diff_to_mean
   <chr>     <chr>            <dbl>
 1 Adelie    Dream           0.0464
 2 Chinstrap Dream          NA     
 3 Gentoo    Biscoe         -6.19  
 4 Adelie    Torgersen       7.05  
 5 Gentoo    Biscoe         12.8   
 6 Adelie    Dream          -4.95  
 7 Adelie    Torgersen       8.05  
 8 Chinstrap Dream          NA     
 9 Adelie    Torgersen       5.05  
10 Chinstrap Dream          NA

Create a new data.frame named avg2 with the mean flipper length by species and by island. Note that Adelie penguins are found on several islands and slightly differ in size.

avg2 <- dat |>
  group_by(species, island) |>
  summarize(mean_flipper_length_mm = mean(flipper_length_mm, na.rm = TRUE)) |>
  ungroup()

`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.

avg2

# A tibble: 5 × 3
  species   island    mean_flipper_length_mm
  <chr>     <chr>                      <dbl>
1 Adelie    Biscoe                      189.
2 Adelie    Dream                       190.
3 Adelie    Torgersen                   191.
4 Chinstrap Dream                       196.
5 Gentoo    Biscoe                      217.

Join datsub to avg2, using both species and island to link the two tables (tip: to control the joining, you can use by = join_by(...)).

full_join(datsub, avg2, by = join_by(species, island))

# A tibble: 11 × 4
   species   island    flipper_length_mm mean_flipper_length_mm
   <chr>     <chr>                 <dbl>                  <dbl>
 1 Adelie    Dream                   190                   190.
 2 Chinstrap Dream                   196                   196.
 3 Gentoo    Biscoe                  211                   217.
 4 Adelie    Torgersen               197                   191.
 5 Gentoo    Biscoe                  230                   217.
 6 Adelie    Dream                   185                   190.
 7 Adelie    Torgersen               198                   191.
 8 Chinstrap Dream                   198                   196.
 9 Adelie    Torgersen               195                   191.
10 Chinstrap Dream                   200                   196.
11 Adelie    Biscoe                   NA                   189.

Long- and short-format tables

There are several ways how a given set of data can be represented in a data.frame. Our dat is an example of the long-format, which places all values of a given type (for example body_mass_g) into a single column, and additional columns (e.g. species or year) are used to provide the context for each value:

dat |>
  select(species, year, body_mass_g)

# A tibble: 344 × 3
   species  year body_mass_g
   <chr>   <dbl>       <dbl>
 1 Adelie   2007        3750
 2 Adelie   2007        3800
 3 Adelie   2007        3250
 4 Adelie   2007          NA
 5 Adelie   2007        3450
 6 Adelie   2007        3650
 7 Adelie   2007        3625
 8 Adelie   2007        4675
 9 Adelie   2007        3475
10 Adelie   2007        4250
# ℹ 334 more rows

The long-format may not always be easy to read, and the wide-format may be a more compact representation of the data.

Some data is typically represented in the wide-format, such as expression data that is usually kept in a gene-by-sample matrix. In contrast to the long-format, the wide-format has multiple columns containing the same type of values (such as expression levels), and the row and column in which a value is found provide the context (the identity of the gene and the sample). This implies that usually there is only a single value for each combination of a row and column, making the wide-format less flexible than the long-format which does not have this restriction, unless a something more complex than a data.frame is used for data representation (we will see an example of that in the exercises).

Long-format and short-format representation are illustrated in the figure below, taken from Introduction to data analysis with R and Bioconductor:

We can use pivot_wider and pivot_longer functions to reshape data from one to the other representation.

As mentioned, dat already is in the long-format. We can select a subset of it and store it in dat_long for illustration:

# randomly select one observation of `body_mass_g`
# for each combination of `species` and `year`
set.seed(43L)
dat_long <- dat |>
  filter(!is.na(body_mass_g)) |>
  select(species, year, body_mass_g) |>
  group_by(species, year) |>
  slice_sample(n = 1) |>
  ungroup()
dat_long

# A tibble: 9 × 3
  species    year body_mass_g
  <chr>     <dbl>       <dbl>
1 Adelie     2007        3000
2 Adelie     2008        3600
3 Adelie     2009        4250
4 Chinstrap  2007        3600
5 Chinstrap  2008        3800
6 Chinstrap  2009        3250
7 Gentoo     2007        5100
8 Gentoo     2008        5000
9 Gentoo     2009        4750

The pivot_wider function can take a long-format data.frame and reshape it into wide-format. For that, we need to specify which column contains the names (names_from) that will become the column names, and which column contains the values (values_from) that will provide the data for the wide-format table:

dat_wide <- dat_long |>
  pivot_wider(names_from = year, values_from = body_mass_g)
dat_wide

# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <chr>      <dbl>  <dbl>  <dbl>
1 Adelie      3000   3600   4250
2 Chinstrap   3600   3800   3250
3 Gentoo      5100   5000   4750

Uniquely identifying observations

The species column was automatically carried over, and it serves to uniquely identify each value in combination with the names_from column(s). It is set using the id_cols argument, which default to all columns expect the ones used in names_from and values_from.

The pivot_longer function can be used to convert a table back from wide-format to long-format. We use the cols argument to specify which columns should be reshaped (here, -species means all except the species column). In addition, we can specify the names for the new columns that will take the names (names_to) and values (values_to), to use alternative to the default “name” and “value”:

# default "name" and "value" columns
dat_wide |> pivot_longer(cols = -species)

# A tibble: 9 × 3
  species   name  value
  <chr>     <chr> <dbl>
1 Adelie    2007   3000
2 Adelie    2008   3600
3 Adelie    2009   4250
4 Chinstrap 2007   3600
5 Chinstrap 2008   3800
6 Chinstrap 2009   3250
7 Gentoo    2007   5100
8 Gentoo    2008   5000
9 Gentoo    2009   4750

# sepecify name and value column names
dat_wide |> pivot_longer(cols = -species,
                         names_to = "year",
                         values_to = "body_mass_g")

# A tibble: 9 × 3
  species   year  body_mass_g
  <chr>     <chr>       <dbl>
1 Adelie    2007         3000
2 Adelie    2008         3600
3 Adelie    2009         4250
4 Chinstrap 2007         3600
5 Chinstrap 2008         3800
6 Chinstrap 2009         3250
7 Gentoo    2007         5100
8 Gentoo    2008         5000
9 Gentoo    2009         4750

You can find a more detailed discussion of pivoting in this vignette from the tidyr package.

Exercises

Starting from dat_long, filter out the row for species == "Chinstrap" and year == 2008, then covert it to wide-format with year in columns. Observe the values in the 2008 column. What happens to the missing values when you convert the result back to long-format?

datex <- dat_long |>
  filter(species != "Chinstrap" | year != 2008) |>
  pivot_wider(names_from = year, values_from = body_mass_g)
datex

# A tibble: 3 × 4
  species   `2007` `2008` `2009`
  <chr>      <dbl>  <dbl>  <dbl>
1 Adelie      3000   3600   4250
2 Chinstrap   3600     NA   3250
3 Gentoo      5100   5000   4750

datex |> pivot_longer(cols = -species)

# A tibble: 9 × 3
  species   name  value
  <chr>     <chr> <dbl>
1 Adelie    2007   3000
2 Adelie    2008   3600
3 Adelie    2009   4250
4 Chinstrap 2007   3600
5 Chinstrap 2008     NA
6 Chinstrap 2009   3250
7 Gentoo    2007   5100
8 Gentoo    2008   5000
9 Gentoo    2009   4750

Starting from dat, select columns species, sex and flipper_length_mm, and then use pivot_wider to create a wide-format table where flipper length values are organized into separate columns by sex. How many columns do you expect in the output? How are the values represented?

dat |>
  select(species, sex, flipper_length_mm) |>
  pivot_wider(names_from = sex, values_from = flipper_length_mm)

Warning: Values from `flipper_length_mm` are not uniquely identified; output will
contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} |>
  dplyr::summarise(n = dplyr::n(), .by = c(species, sex)) |>
  dplyr::filter(n > 1L)

# A tibble: 3 × 4
  species   male       female     `NA`     
  <chr>     <list>     <list>     <list>   
1 Adelie    <dbl [73]> <dbl [73]> <dbl [6]>
2 Gentoo    <dbl [61]> <dbl [58]> <dbl [5]>
3 Chinstrap <dbl [34]> <dbl [34]> <NULL>

Starting from dat, select columns species, sex and flipper_length_mm, add a column id with values one to n that assigns a unique number to each observation, and then use pivot_wider to create a wide-format table where flipper length values are organized into separate columns by sex. How are the values represented (compare to the result from the previous exercise)?

dat |>
  select(species, sex, flipper_length_mm) |>
  mutate(id = seq_along(species)) |>
  pivot_wider(names_from = sex, values_from = flipper_length_mm)

# A tibble: 344 × 5
   species    id  male female  `NA`
   <chr>   <int> <dbl>  <dbl> <dbl>
 1 Adelie      1   181     NA    NA
 2 Adelie      2    NA    186    NA
 3 Adelie      3    NA    195    NA
 4 Adelie      4    NA     NA    NA
 5 Adelie      5    NA    193    NA
 6 Adelie      6   190     NA    NA
 7 Adelie      7    NA    181    NA
 8 Adelie      8   195     NA    NA
 9 Adelie      9    NA     NA   193
10 Adelie     10    NA     NA   190
# ℹ 334 more rows

Reshaping data for plotting

Data visualization will be the focus of the session dedicated to ggplot2, but one aspect of plotting data is worth mentioning here that relates to the format used to represent the data:

In order to plot data with ggplot2, the data is best organized in long-format. This simplifies assigning column to the scale(s) of a plot, such as x-axis, y-axis or color scales.

Here is an example using dat_long (don’t worry about the details of ggplot2 - it will all be discussed in detail later): We want to plot the body mass penguins (y-axis) as a function of year (x-axis) and species (color). aes() maps columns to axes (e.g. x = year), and theme_light(24) chooses a light plot style (e.g. white background) with a base font size of 24.

library(ggplot2)

ggplot(dat_long, aes(x = year, y = body_mass_g, color = species)) +
  geom_point() +
  theme_light(24)

Exercises

Modify the above ggplot such that it plots body_mass_g on the x-axis, and year on the y-axis.

ggplot(dat_long, aes(x = body_mass_g, y = year)) +
  geom_point() +
  theme_light(24)

Modify the above ggplot such that year also used as the color. What do you observed for the year-color scheme (compared to the species- colors above)? How can you use categorical instead of continuous colors for year? (Hint: What is R’s data type for categories?)

ggplot(dat_long, aes(x = body_mass_g, y = year, color = factor(year))) +
  geom_point() +
  theme_light(24)

Some more plotting examples

Let’s use all the data in dat now. As mentioned, dat is in long-format and we can directly use it for plotting:

ggplot(dat, aes(x = year, y = body_mass_g, color = species)) +
  geom_point() +
  theme_light(24)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

The many values collected in a single year lead to some over-plotting. We can make the plot a bit more informative, by adding some jittering to the x values to reduce the number of overlapping points. Again, don’t worry if you see the plotting functions for the first time; they will all be discussed later and just serve here to illustrate that you can create a sophisticated plot with little effort, as long as your data is organized in long-format:

ggplot(dat, aes(x = year, y = body_mass_g, color = species)) +
  geom_jitter(width = 0.2, height = 0) +
  labs(x = "Year", y = "Body mass (g)") +
  scale_x_continuous(breaks = unique(dat$year)) +
  theme_light(24)

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

An alternative that summarizes the observations is for example created using geom_violin:

ggplot(dat, aes(x = factor(year), y = body_mass_g, fill = species)) +
  geom_violin(draw_quantiles = 0.5) +
  labs(x = "Year", y = "Body mass (g)") +
  theme_light(24)

Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).

More information

Session info

R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.7

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Zurich
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] lubridate_1.9.3  forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4     
 [5] purrr_1.0.2      readr_2.1.5      tidyr_1.3.1      tibble_3.2.1    
 [9] ggplot2_3.5.1    tidyverse_2.0.0  BiocStyle_2.32.1 png_0.1-8       
[13] knitr_1.48      

loaded via a namespace (and not attached):
 [1] utf8_1.2.4          generics_0.1.3      stringi_1.8.4      
 [4] hms_1.1.3           digest_0.6.37       magrittr_2.0.3     
 [7] evaluate_1.0.1      timechange_0.3.0    fastmap_1.2.0      
[10] jsonlite_1.8.9      BiocManager_1.30.25 fansi_1.0.6        
[13] scales_1.3.0        cli_3.6.3           rlang_1.1.4        
[16] crayon_1.5.3        bit64_4.5.2         munsell_0.5.1      
[19] withr_3.0.1         yaml_2.3.10         tools_4.4.1        
[22] parallel_4.4.1      tzdb_0.4.0          colorspace_2.1-1   
[25] vctrs_0.6.5         R6_2.5.1            lifecycle_1.0.4    
[28] htmlwidgets_1.6.4   bit_4.5.0           vroom_1.6.5        
[31] pkgconfig_2.0.3     pillar_1.9.0        gtable_0.3.5       
[34] glue_1.8.0          xfun_0.48           tidyselect_1.2.1   
[37] rstudioapi_0.16.0   farver_2.1.2        htmltools_0.5.8.1  
[40] rmarkdown_2.28      labeling_0.4.3      compiler_4.4.1