Analysing data with the data verbs
Welcome! This tutorial is all about the 'data verbs' - a set of high-level operations that you can apply to data frames. Like the small, simple tools in the UNIX command line, these operations are individually simple but when put into pipelines
data %>% first operation %>% second operation %>% ...
can produce really useful results, really quickly.
This tutorial has a cheat sheet! You should have been handed this, or right-click on the image below and save it to your computer.
Eight data verbs
You should learn these data verbs! Luckily there are only seven eight to worry about:
| Operation | Description | In R with dplyr | In pythonwith polars | In SQLdatabases |
|---|---|---|---|---|
Select | Select columns, and maybe rename | select() | .select() | SELECT |
Mutate | Add new columns | mutate() | .mutate() | ADD COLUMN |
Filter | Filter rows based on column values | filter() | .filter() | SELECT |
Slice | Takes a 'slice' of rows, such as the top few or the last few | head()or slice_head(), slice_tail(), or slice_sample() | ? | (Varies by database) |
Arrange or sort | Sort rows based on column values | arrange() | .sort() | ORDER BY |
Summarise or aggregate | Combine values across rows | summarise() | .agg() | various functions, like SUM() |
Group by | Group rows based on column values | group_by() | .groupby() | GROUP BY |
Join | Join two data frames by column values | inner_join()or left_, right_, or full_join() | .join() | INNER JOINor LEFT, RIGHT, or FULL JOIN |
(Ok, there are also a few more once you get into this, like the pivot or reshape. But the above will get you a long way.)
Data verbs in different languages
We're going to use R for this tutorial, and specifically the tidyverse / dplyr implementation of the data verbs.
However the ideas are general and, as the table above suggests, you can use these same 'verbs' in other settings as well. For example:
- Most of the 'data verbs' are part of the SQL language used to talk to most relational databases (like MySQL or sqlite).
- You can use them in python in a similar way to in R (the polars library is a great way to do this).
Did I mention you should learn these verbs?
In this tutorial we'll use R, which has a particularly nice and easy-to-use implementation in the dplyr package (part of tidyverse), but feel free to translate these into other languages if you prefer.
Get started
Ready to start? Start an R session and let's go and get the data.