SUMMARISE or AGGREGATE combines across rows
Maybe we don't want data for every country but a summary of all countries put together? For example, what's the minimum, maximum, or average life expectancy across all countries and years?
This is a job for... summarise:
(
reformatted
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
number_of_rows = n()
)
)
Hey! It doesn't work!
Any idea why not?
Oh we have some rows with life_expectancy = NA
. No problem, let's get rid of them:
(
reformatted
%>% filter( !is.na( life_expectancy ))
%>% summarise(
min_life_expectancy = min( life_expectancy ),
average_life_expectancy = mean( life_expectancy ),
max_life_expectancy = max( life_expectancy ),
n()
)
)
Hmm... that range of life expectancy is huge.
Use filter()
to find the rows with those minimum and maximum life expectancies. What countries / years are these?
In the above, we used n()
to give a count of the the rows in each group.
This is actually a special function defined in dplyr.
You can see other useful ones in the summarise()
documentation.
(It's also possible to write your own.)
You don't need to know all these, but n()
is particularly useful to remember.