Skip to main content

GROUP BY groups rows by column values

Getting the summary statistics across all countries and years at the same time may be a bit silly. What if we wanted to see these metrics grouped by country, or year?

This is - of course - a job for group by. For example, let's work out the minimum, average, and maximum life expectancy in each year of the data:

(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year ) # <- this is the only new operation!
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)

Have a look at the result. You should see that there is now one row per year (but no seperate countries).

Cool! Let's plot it:

(
ggplot(
data = (
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year )
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)
)
+ geom_ribbon(
aes( x = year, y = average_life_expectancy, ymin = min_life_expectancy, ymax = max_life_expectancy ),
fill = 'grey80',
colour = 'black'
)
+ geom_line( aes( x = year, y = average_life_expectancy ), linewidth = 2)
+ theme_minimal()
+ ylab( "Life expectancy\n(min/mean/max)" )
)

This combination of group by and summarise is very powerful. Here are some challenge questions to see if you have got the hang of it:

Questions

Easy question: can you compute average (or median) life expectancy per country - averaging over years?

Harder question: you can also group by two things at once. Can you repeat the plot above but with the line having different colours for developed and developing countries?

Another question: You can also of course summarise multiple variables at once. Can you plot under-five mortality against life expectancy, taking an average over the years?

Hints.

  1. You can use group_by( year, status ) to group the data by both year and status.

  2. You'll also have to add the status variable into the ggplot() command to tell it to colour by this. They go inside those aes() commands, like aes( ..., colour = status ). (aes is short for 'aesthetic mapping', which means a mapping from data variables to visual aspects of the plot.)