GROUP BY groups rows by column values
Although it's fun to summarise, it'd be a lot more fun if we could do it in different groups.
This is (of course) a job for group by:
(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year ) # <- this is the only new operation!
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)
Cool! Let's plot it:
(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% group_by( year )
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
) %>% ggplot()
+ geom_line( aes( x = year, y = average_life_expectancy ), linewidth = 2)
+ geom_line( aes( x = year, y = min_life_expectancy ), linetype = 2, linewidth = 2 )
+ geom_line( aes( x = year, y = max_life_expectancy ), linetype = 2, linewidth = 2 )
+ theme_minimal()
+ ylab( "Life expectancy" )
)
This combination of group by and summarise is very powerful.
Easy question: can you compute average (or median) life expectancy per country - averaging over years?
Harder question: you can also group by two things at once. Can you repeat the plot above but with different colours for developed and developing countries?
Hints.
You can use
group_by( year, status )
to group the data by both year and status.You'll also have to add the status variable into the
ggplot()
command. They go inside thoseaes()
commands, which are the 'aesthetic mappings' or 'channels' linking data variables to plot components.