JOIN joins dataframes together

Wouldn't it be nice if our data frame told us which continent each country was in?

It doesn't, but here is a data frame that does (for a few countries):

continents = readr::read_tsv( "https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/continents.tsv")

How could we join the continent data onto our main one? Simple - we join it!

(
    reformatted
    %>% inner_join( continents, by = "country" )
)

Cool! Inspect the output now to make sure you know what it has done.

Anatomy of a join

As you can tell from the form above, you join two dataframes together 'by' a column, or set of columns. The join finds rows that have the same column values, and puts them together on the same row.

It's kindof like this pseudocode:

for row in the first dataframe:
find all rows in the other dataframe with the same 'by' column values
add the remaining columns of 'other' for these rows to this one

...except it's all done in one operation - you don't have to write out all that code.

Unfortunately there is a problem with our join!

Warning!

Something has gone wrong though. Can you see what it is?

(Tip: what countries are in the joined data? How many rows are there for each? Hint. you could use group by and sumarise to find this out.)

The different flavours of join

There are two problems in the data above, and they illustrate the behaviour of join.

Problem 1: repeated values

The first problem is that one of the countries actually was repeated in the continents data frame. (Did you notice this? Which one was it? You could use group by / summarise, or just a simple table() to find out.)

The join has done its job properly: it has joined each row of reformatted onto each matching row of continents. If there were two matching rows of continents, you'll of course get two rows in the output. (Our data frame has 16 rows for most countries, so that would mean two of each of these - 32 rows in the output - in total...)

To fix this we'd better remove the duplicated row. It happens to be row number 5:

continents = continents[-5,]

Now join them again to fix that problem.

Problem 2: missing values

The second problem is that... hey, we've lost most of our countries! (For example - where is the UK?).

This illustrates another issue with the join above. It has removed any rows that don't match.

This is actually a feature of the 'inner' type of join we used above and it can be solved by trying a different join operation. Specifically we could use a 'left join' which keeps all rows in the first dataframe even if nothing matches:

(
    reformatted
    %>% left_join( continents, by = "country" )
)

Now we get all our rows back! (Although most of them have missing continent values.)

Important moral

The moral of the story is:

Before joining, make sure you know what is in your data

In particular you should know what missing values there are, and what if any repeat values.

Different join forms

If you think about this for a moment you'll see there are four sensible behaviours a join could try here - and each of them has its own name / function. Here is a quick guide:

"Inner" join (inner_join() in dplyr) returns only rows that match between both datasets
"Left" join (left_join() in dplyr) return all rows of the left table, filling in missing data on the right as needed.
"Right" join (right_join() in dplyr) return all rows of the right table, filling in missing data on the left as needed.
"Full" or "outer" join (full_join() in dplyr) return all rows, filling in missing data in the eft or right table as needed.

Test it out

Try testing these four joins using these two data frames. (We'll make them using tribble, which is a handy way to make data frames by row).

A = tibble::tribble(
    ~ID, ~value_1,
    'a', 1,
    'b', 2,
    'c', 3,
    'd', 4
)

B = tibble::tribble(
    ~ID, ~value_2,
    'c', 3,
    'd', 4,
    'e', 5,
    'f', 6
)

Try joining A to B (by ID), using either an inner, left, right, or full join. Do you understand the results?

A join example

Let's use join() to compare two countries against each other - for example, the UK and Kenya:

UK = reformatted %>% filter( country == 'UK' ) %>% select( country, year, life_expectancy )
kenya = reformatted %>% filter( country == 'Kenya' ) %>% select( country, year, life_expectancy )

combined = (
    inner_join( UK, kenya, by = "year" )
)

(Actually we can improve that by giving the columns better names):

combined = (
    inner_join( UK, kenya, by = "year", suffix = c( "_UK", "_Kenya" ) )
)

Maybe you want that all in one go?

combined = (
    reformatted
    %>% filter( country == 'UK' ) %>% select( country, year, life_expectancy )
    %>% inner_join(
        reformatted
        %>% filter( country == 'Kenya' ) %>% select( country, year, life_expectancy )
    )
)

:::

JOIN joins dataframes together

The different flavours of join​

Problem 1: repeated values​

Problem 2: missing values​

Different join forms​

A join example​

The different flavours of join

Problem 1: repeated values

Problem 2: missing values

Different join forms

A join example