Analysing data with the data verbs
Welcome!
This tutorial is all about the 'data verbs' - a set of high-level operations that you can apply to data frames. Like the small, simple tools in the UNIX command line, these operations are individually simple but when put into pipelines
data %>% first operation %>% second operationr %>% ...
can produce really interesting results, really quickly.
You should learn these verbs! Luckily there are only seven eight to worry about:
Operation | Description | In R with dplyr | In python with pandas | In SQL databases |
---|---|---|---|---|
Select | Select columns, and maybe rename | select() | .select() | SELECT |
Mutate | Add new columns | mutate() | .mutate() | ADD COLUMN |
Filter | Filter rows based on column values | filter() | .filter() | SELECT |
Slice | Takes a 'slice' of rows, such as the top few or the last few | head() or slice_head() , slice_tail() , or slice_sample() | ? | (Varies by database) |
Arrange or sort | Sort rows based on column values | arrange() | .sort() | ORDER BY |
Summarise or aggregate | Combine values across rows | summarise() | .agg() | various functions, like SUM() |
Group by | Group rows based on column values | group_by() | .groupby() | GROUP BY |
Join | Join two data frames by column values | inner_join() or left_ , right_ , or full_join() | .join() | INNER JOIN or LEFT , RIGHT , or FULL JOIN |
(Plus a bonus verb - Pivot or reshape, which you can look at if you want.)
This tutorial works in R. But as the table indicates, the data verbs are general, and they transcend individual programming languages or systems. The same operations can easily be used in other languages like python, or in common relational databases like MySQL.
Did I mention you should learn these verbs?
Get the data
For this tutorial we'll use data on life expectancy and on economic factors in 193 countries globally, in the period 2000-2015. The underlying data is from the WHO Global Health Observatory and the United Nations, but the version we're using here was downloaded from kaggle.com.
To get started, start an R session and let's load the data:
data = readr::read_csv(
"https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/life_expectancy_data.csv"
)
And don't forget to load our favourite libraries:
library( dplyr )
library( ggplot2 )
Before we get started, have a look at the data frame briefly using your R skills to make sure you know what's in there.
How many columns are there? How many rows? What are the column names? What do the first few rows of data look like?
Ready to start? Let's select some columns.