Skip to main content

Analysing data with the data verbs

Welcome! This tutorial is all about the 'data verbs' - a set of high-level operations that you can apply to data frames. Like the small, simple tools in the UNIX command line, these operations are individually simple but when put into pipelines

data %>% first operation %>% second operation %>% ...

can produce really useful results, really quickly.

Note

This tutorial has a cheat sheet! You should have been handed this, or right-click on the image below and save it to your computer.

file

Eight data verbs

You should learn these data verbs! Luckily there are only seven eight to worry about:

OperationDescriptionIn R
with dplyr
In python
with polars
In SQL
databases
SelectSelect columns, and maybe renameselect().select()SELECT
MutateAdd new columnsmutate().mutate()ADD COLUMN
FilterFilter rows based on column valuesfilter().filter()SELECT
SliceTakes a 'slice' of rows, such as the top few or the last fewhead()
or slice_head(), slice_tail(), or slice_sample()
?(Varies by database)
Arrange or sortSort rows based on column valuesarrange().sort()ORDER BY
Summarise or aggregateCombine values across rowssummarise().agg()various functions, like SUM()
Group byGroup rows based on column valuesgroup_by().groupby()GROUP BY
JoinJoin two data frames by column valuesinner_join()
or left_, right_, or full_join()
.join()INNER JOIN
or LEFT, RIGHT, or FULL JOIN

(Ok, there are also a few more once you get into this, like the pivot or reshape. But the above will get you a long way.)

Data verbs in different languages

We're going to use R for this tutorial, and specifically the tidyverse / dplyr implementation of the data verbs.

However the ideas are general and, as the table above suggests, you can use these same 'verbs' in other settings as well. For example:

  • Most of the 'data verbs' are part of the SQL language used to talk to most relational databases (like MySQL or sqlite).
  • You can use them in python in a similar way to in R (the polars library is a great way to do this).

Did I mention you should learn these verbs?

In this tutorial we'll use R, which has a particularly nice and easy-to-use implementation in the dplyr package (part of tidyverse), but feel free to translate these into other languages if you prefer.

Get started

Ready to start? Start an R session and let's go and get the data.