FILTER filters rows
The third data verb is filter, which filters rows of the data frame based on column values.
Before filtering, make a table of what countries and years are in the data.
There are an awful lot of them! Let's get this down to a more manageable-sized data frame that we can easily plot. For example, let's just focus on a subset of countries, like these ones:
countries = c(
"UK",
"Ukraine",
"Afghanistan",
"Ghana",
"Malawi",
"Kenya",
"Brazil",
"USA",
"Japan",
"South Korea",
"Australia",
"China"
)
To get just these countries - let's use filter:
(
reformatted
%>% filter(
country %in% countries
)
)
table() the countries and years again. Did the filtering work?
Is filtering fun? Of course it is! One reason is that this is now few enough countries that we can reasonably put them all on one plot. First, here are some nice colours (these ones come from this page):
colours = c(
'#1F83B4', '#12A2A8', '#2CA030', '#78A641',
'#BCBD22', '#FFBF50', '#FFAA0E', '#FF7F0E',
'#D63A3A', '#C7519C', '#BA43B4', '#8A60B0',
'#6F63BB'
)
And let's plot the life expectancy in each year:
(
reformatted
%>% filter(
country %in% countries
)
%>% ggplot(
mapping = aes( x = year, y = life_expectancy, colour = country )
)
+ geom_line()
+ scale_colour_manual( values = colours )
)
Ok that plot is not good enough - let's make it better. We can do this through various tweaks which are all part of ggplot2. Here goes:
(
reformatted
%>% filter(
country %in% countries
)
%>% ggplot(
mapping = aes( x = year, y = life_expectancy, colour = country )
)
+ geom_line( linewidth = 2 ) # <-- make lines thicker
+ geom_point( size = 4 ) # <-- draw points as well
+ scale_colour_manual( values = colours )
+ theme_minimal( 24 ) # <-- make the fonts bigger and get rid of the grey background
+ xlab( "Year" ) # <-- add axis labels
+ ylab( "Life expectancy" )
# ...and one last bit, to rotate the y axis label:
+ theme( axis.title.y = element_text( angle = 0, vjust = 0.5 ))
)
It's more or less true that life expectancy has increased in these 16 years in all the countries we looked at. (Although a bunch of data points seem to be outliers, for some reason.)