Skip to main content

GROUP BY groups rows by column values

Although it's fun to summarise, it'd be a lot more fun if we could do it in different groups.

This is (of course) a job for group by:

(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year ) # <- this is the only new operation!
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)

Cool! Let's plot it:

(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year )
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
) %>% ggplot()
+ geom_line( aes( x = year, y = average_life_expectancy ), linewidth = 2)
+ geom_line( aes( x = year, y = min_life_expectancy ), linetype = 2, linewidth = 2 )
+ geom_line( aes( x = year, y = max_life_expectancy ), linetype = 2, linewidth = 2 )
+ theme_minimal()
+ ylab( "Life expectancy" )
)

This combination of group by and summarise is very powerful.

Questions

Easy question: can you compute average (or median) life expectancy per country - averaging over years?

Harder question: you can also group by two things at once. Can you repeat the plot above but with different colours for developed and developing countries?

Hints.

  1. You can use group_by( year, status ) to group the data by both year and status.

  2. You'll also have to add the status variable into the ggplot() command. They go inside those aes() commands, which are the 'aesthetic mappings' or 'channels' linking data variables to plot components.