In the previous lecture we saw how to manipulate data.frames, for example how to select specific rows using slice, or how to summarize data using summarise or group_by.
In this lecture, we will see how to combine multiple data.frames using join-operations, and how to reformat your data into so called long or short format. With these tools we will be well equipped to start exploring our data using ggplot2 (teaser at the end and next session).
Load the tidyverse package
We start by loading the tidyverse package.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load data
dat <- readr::read_csv("data/penguins.csv")
Rows: 344 Columns: 8
── Column specification ───────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Joining tables
Quite frequently, we have to combine information from multiple tables into a single one, for example because the data was collected from different sources.
Join operations do this for us, and they do not require that the rows of the combined tables correspond to each other. Instead, the use a key column, which links rows across the tables.
A simple example
Here, we create two tables derived from dat for later joining:
# mean flipper length by speciesavg <- dat |>group_by(species) |>summarize(mean_flipper_length_mm =mean(flipper_length_mm, na.rm =TRUE),n =n()) |>filter(n >100)avg
# A tibble: 2 × 3
species mean_flipper_length_mm n
<chr> <dbl> <int>
1 Adelie 190. 152
2 Gentoo 217. 124
# random subset of `dat`set.seed(42L)datsub <- dat |>select(species, island, flipper_length_mm) |>slice_sample(n =10)datsub
The goal is now to add the mean flipper length from avg to the table datsub, using species to link the two:
full_join(datsub, avg)
Joining with `by = join_by(species)`
# A tibble: 10 × 5
species island flipper_length_mm mean_flipper_length_mm n
<chr> <chr> <dbl> <dbl> <int>
1 Adelie Dream 190 190. 152
2 Chinstrap Dream 196 NA NA
3 Gentoo Biscoe 211 217. 124
4 Adelie Torgersen 197 190. 152
5 Gentoo Biscoe 230 217. 124
6 Adelie Dream 185 190. 152
7 Adelie Torgersen 198 190. 152
8 Chinstrap Dream 198 NA NA
9 Adelie Torgersen 195 190. 152
10 Chinstrap Dream 200 NA NA
Controlling how to link rows
In this exapmle, we did not specify how to link corresponding rows in datsub and avg. By default, full_join(..., by = NULL) will select all columns in common across the two tables and will print a message so we can check if it correctly identified the joining variable(s).
You could use by = join_by(...) to explicitly control our joining:
full_join(datsub, avg, by =join_by(species))
# A tibble: 10 × 5
species island flipper_length_mm mean_flipper_length_mm n
<chr> <chr> <dbl> <dbl> <int>
1 Adelie Dream 190 190. 152
2 Chinstrap Dream 196 NA NA
3 Gentoo Biscoe 211 217. 124
4 Adelie Torgersen 197 190. 152
5 Gentoo Biscoe 230 217. 124
6 Adelie Dream 185 190. 152
7 Adelie Torgersen 198 190. 152
8 Chinstrap Dream 198 NA NA
9 Adelie Torgersen 195 190. 152
10 Chinstrap Dream 200 NA NA
Different types of join operations
When calculating avg above, we excluded species with less than 100 observations. As a side-effect, avg doesn’t contain the mean flipper length for the species Chinstrap.
The full_join has inserted NA values for such missing data. There are alternative join-functions that behave differently, for example:
In general, we discriminate four different types of joins:
inner_join is the simplest, which keeps only rows that occur in both tables.
To avoid loosing any observations, we have alternative outer joins that keep rows even if they cannot be linked across tables, filling in the missing values as NAs.
left_join keeps all rows in the first (left) table x
right_join keeps all rows in the second (right) table y
full_join keeps all rows from both tables x and y
This is only a short overview of joining operations. To learn more details, you can look at the Joins section from the open R for Data Science book, from which the above schemes and the venn diagram below have been taken.
In summary, you can think of the join operations also as set operations, that combine the observations (rows) from two sets (tables) and keep a defined subset:
Exercises
Instead of using inner_join(), produce the same result when joining datsub and avg by using full_join() and filter().
# inner joinres1 <-inner_join(datsub, avg)
Joining with `by = join_by(species)`
# full_join and filterres2 <-full_join(datsub, avg) |>filter(!is.na(mean_flipper_length_mm))
Joining with `by = join_by(species)`
# compare resultsidentical(res1, res2)
[1] TRUE
Use a join and mutate to calculate the difference to the mean flipper length in avg for the observations in datsub.
full_join(datsub, avg, by =join_by(species)) |>mutate(diff_to_mean = flipper_length_mm - mean_flipper_length_mm) |>select(species, island, diff_to_mean)
# A tibble: 10 × 3
species island diff_to_mean
<chr> <chr> <dbl>
1 Adelie Dream 0.0464
2 Chinstrap Dream NA
3 Gentoo Biscoe -6.19
4 Adelie Torgersen 7.05
5 Gentoo Biscoe 12.8
6 Adelie Dream -4.95
7 Adelie Torgersen 8.05
8 Chinstrap Dream NA
9 Adelie Torgersen 5.05
10 Chinstrap Dream NA
Create a new data.frame named avg2 with the mean flipper length by species and by island. Note that Adelie penguins are found on several islands and slightly differ in size.
avg2 <- dat |>group_by(species, island) |>summarize(mean_flipper_length_mm =mean(flipper_length_mm, na.rm =TRUE)) |>ungroup()
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
There are several ways how a given set of data can be represented in a data.frame. Our dat is an example of the long-format, which places all values of a given type (for example body_mass_g) into a single column, and additional columns (e.g. species or year) are used to provide the context for each value:
The long-format may not always be easy to read, and the wide-format may be a more compact representation of the data.
Some data is typically represented in the wide-format, such as expression data that is usually kept in a gene-by-sample matrix. In contrast to the long-format, the wide-format has multiple columns containing the same type of values (such as expression levels), and the row and column in which a value is found provide the context (the identity of the gene and the sample). This implies that usually there is only a single value for each combination of a row and column, making the wide-format less flexible than the long-format which does not have this restriction, unless a something more complex than a data.frame is used for data representation (we will see an example of that in the exercises).
We can use pivot_wider and pivot_longer functions to reshape data from one to the other representation.
As mentioned, dat already is in the long-format. We can select a subset of it and store it in dat_long for illustration:
# randomly select one observation of `body_mass_g`# for each combination of `species` and `year`set.seed(43L)dat_long <- dat |>filter(!is.na(body_mass_g)) |>select(species, year, body_mass_g) |>group_by(species, year) |>slice_sample(n =1) |>ungroup()dat_long
The pivot_wider function can take a long-formatdata.frame and reshape it into wide-format. For that, we need to specify which column contains the names (names_from) that will become the column names, and which column contains the values (values_from) that will provide the data for the wide-format table:
The species column was automatically carried over, and it serves to uniquely identify each value in combination with the names_from column(s). It is set using the id_cols argument, which default to all columns expect the ones used in names_from and values_from.
The pivot_longer function can be used to convert a table back from wide-format to long-format. We use the cols argument to specify which columns should be reshaped (here, -species means all except the species column). In addition, we can specify the names for the new columns that will take the names (names_to) and values (values_to), to use alternative to the default “name” and “value”:
# default "name" and "value" columnsdat_wide |>pivot_longer(cols =-species)
You can find a more detailed discussion of pivoting in this vignette from the tidyr package.
Exercises
Starting from dat_long, filter out the row for species == "Chinstrap" and year == 2008, then covert it to wide-format with year in columns. Observe the values in the 2008 column. What happens to the missing values when you convert the result back to long-format?
# A tibble: 3 × 4
species `2007` `2008` `2009`
<chr> <dbl> <dbl> <dbl>
1 Adelie 3000 3600 4250
2 Chinstrap 3600 NA 3250
3 Gentoo 5100 5000 4750
datex |>pivot_longer(cols =-species)
# A tibble: 9 × 3
species name value
<chr> <chr> <dbl>
1 Adelie 2007 3000
2 Adelie 2008 3600
3 Adelie 2009 4250
4 Chinstrap 2007 3600
5 Chinstrap 2008 NA
6 Chinstrap 2009 3250
7 Gentoo 2007 5100
8 Gentoo 2008 5000
9 Gentoo 2009 4750
Starting from dat, select columns species, sex and flipper_length_mm, and then use pivot_wider to create a wide-format table where flipper length values are organized into separate columns by sex. How many columns do you expect in the output? How are the values represented?
dat |>select(species, sex, flipper_length_mm) |>pivot_wider(names_from = sex, values_from = flipper_length_mm)
Warning: Values from `flipper_length_mm` are not uniquely identified; output will
contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
{data} |>
dplyr::summarise(n = dplyr::n(), .by = c(species, sex)) |>
dplyr::filter(n > 1L)
Starting from dat, select columns species, sex and flipper_length_mm, add a column id with values one to n that assigns a unique number to each observation, and then use pivot_wider to create a wide-format table where flipper length values are organized into separate columns by sex. How are the values represented (compare to the result from the previous exercise)?
# A tibble: 344 × 5
species id male female `NA`
<chr> <int> <dbl> <dbl> <dbl>
1 Adelie 1 181 NA NA
2 Adelie 2 NA 186 NA
3 Adelie 3 NA 195 NA
4 Adelie 4 NA NA NA
5 Adelie 5 NA 193 NA
6 Adelie 6 190 NA NA
7 Adelie 7 NA 181 NA
8 Adelie 8 195 NA NA
9 Adelie 9 NA NA 193
10 Adelie 10 NA NA 190
# ℹ 334 more rows
Reshaping data for plotting
Data visualization will be the focus of the session dedicated to ggplot2, but one aspect of plotting data is worth mentioning here that relates to the format used to represent the data:
In order to plot data with ggplot2, the data is best organized in long-format. This simplifies assigning column to the scale(s) of a plot, such as x-axis, y-axis or color scales.
Here is an example using dat_long (don’t worry about the details of ggplot2 - it will all be discussed in detail later): We want to plot the body mass penguins (y-axis) as a function of year (x-axis) and species (color). aes() maps columns to axes (e.g. x = year), and theme_light(24) chooses a light plot style (e.g. white background) with a base font size of 24.
library(ggplot2)ggplot(dat_long, aes(x = year, y = body_mass_g, color = species)) +geom_point() +theme_light(24)
Exercises
Modify the above ggplot such that it plots body_mass_g on the x-axis, and year on the y-axis.
ggplot(dat_long, aes(x = body_mass_g, y = year)) +geom_point() +theme_light(24)
Modify the above ggplot such that year also used as the color. What do you observed for the year-color scheme (compared to the species- colors above)? How can you use categorical instead of continuous colors for year? (Hint: What is R’s data type for categories?)
ggplot(dat_long, aes(x = body_mass_g, y = year, color =factor(year))) +geom_point() +theme_light(24)
Some more plotting examples
Let’s use all the data in dat now. As mentioned, dat is in long-format and we can directly use it for plotting:
ggplot(dat, aes(x = year, y = body_mass_g, color = species)) +geom_point() +theme_light(24)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
The many values collected in a single year lead to some over-plotting. We can make the plot a bit more informative, by adding some jittering to the x values to reduce the number of overlapping points. Again, don’t worry if you see the plotting functions for the first time; they will all be discussed later and just serve here to illustrate that you can create a sophisticated plot with little effort, as long as your data is organized in long-format:
ggplot(dat, aes(x = year, y = body_mass_g, color = species)) +geom_jitter(width =0.2, height =0) +labs(x ="Year", y ="Body mass (g)") +scale_x_continuous(breaks =unique(dat$year)) +theme_light(24)
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
An alternative that summarizes the observations is for example created using geom_violin:
ggplot(dat, aes(x =factor(year), y = body_mass_g, fill = species)) +geom_violin(draw_quantiles =0.5) +labs(x ="Year", y ="Body mass (g)") +theme_light(24)
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_ydensity()`).