GROUP BY groups rows by column values
Getting the summary statistics across all countries and years at the same time may be a bit silly. What if we wanted to see these metrics grouped by country, or year?
This is - of course - a job for group by. For example, let's work out the minimum, average, and maximum life expectancy in each year of the data:
(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year ) # <- this is the only new operation!
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)
Have a look at the result. You should see that there is now one row per year (but no seperate countries).
Cool! Let's plot it:
(
ggplot(
data = (
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year )
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)
)
+ geom_ribbon(
aes( x = year, y = average_life_expectancy, ymin = min_life_expectancy, ymax = max_life_expectancy ),
fill = 'grey80',
colour = 'black'
)
+ geom_line( aes( x = year, y = average_life_expectancy ), linewidth = 2)
+ theme_minimal()
+ ylab( "Life expectancy\n(min/mean/max)" )
)
This combination of group by and summarise is very powerful. Here are some challenge questions to see if you have got the hang of it:
Easy question: can you compute average (or median) life expectancy per country - averaging over years?
Harder question: you can also group by two things at once. Can you repeat the plot above but with the line having different colours for developed and developing countries?
Another question: You can also of course summarise multiple variables at once. Can you plot under-five mortality against life expectancy, taking an average over the years?
Hints.
You can use
group_by( year, status )to group the data by both year and status.You'll also have to add the
statusvariable into theggplot()command to tell it to colour by this. They go inside thoseaes()commands, likeaes( ..., colour = status ). (aesis short for 'aesthetic mapping', which means a mapping from data variables to visual aspects of the plot.)