Skip to main content

Analysing data with the data verbs

Welcome!

This tutorial is all about the 'data verbs' - a set of high-level operations that you can apply to data frames. Like the small, simple tools in the UNIX command line, these operations are individually simple but when put into pipelines

data %>% first operation %>% second operationr %>% ...

can produce really interesting results, really quickly.

You should learn these verbs! Luckily there are only seven to worry about:

OperationDescriptionIn R
with dplyr
In python
with pandas
In SQL
databases
SelectSelect columns, and maybe renameselect().select()SELECT
MutateAdd new columnsmutate().mutate()ADD COLUMN
FilterFilter rows based on column valuesfilter().filter()SELECT
Arrange or sortSort rows based on column valuesarrange().sort()ORDER BY
Summarise or aggregateCombine values across rowssummarise().agg()various functions, like SUM()
Group byGroup rows based on column valuesgroup_by().groupby()GROUP BY
JoinJoin two data frames by column valuesinner_join()
or left_, right_, or full_join()
.join()INNER JOIN
or LEFT, RIGHT, or FULL JOIN

(Plus a bonus verb - Pivot or reshape, which you can look at if you want.)

Note

This tutorial works in R. But as the table indicates, the data verbs are general, and they transcend individual programming languages or systems. The same operations can easily be used in other languages like python, or in common relational databases like MySQL.

Did I mention you should learn these verbs?

Get the data

For this tutorial we'll use data on life expectancy and on economic factors in 193 countries globally, in the period 2000-2015. The underlying data is from the WHO Global Health Observatory and the United Nations, but the version we're using here was downloaded from kaggle.com.

To get started, start an R session and let's load the data:

data = readr::read_csv(
"https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/life_expectancy_data.csv"
)

And don't forget to load our favourite libraries:

library( dplyr )
library( ggplot2 )

Before we get started, have a look at the data frame briefly using your R skills to make sure you know what's in there.

Question

How many columns are there? How many rows? What are the column names? What do the first few rows of data look like?

Ready to start? Let's select some columns.