JOIN joins dataframes together
Wouldn't it be nice if our data frame told us which continent each country was in?
It doesn't, but here is a data frame that does (for a few countries):
continents = readr::read_tsv( "https://www.chg.ox.ac.uk/bioinformatics/training/gms/data/continents.tsv")
How could we join the continent data onto our main one? Simple - we join it!
(
reformatted
%>% inner_join( continents, by = "country" )
)
Cool! Inspect the output now to make sure you know what it has done.
As you can tell from the form above, you join two dataframes together 'by' a column, or set of columns. The join finds rows that have the same column values, and puts them together on the same row.
It's kindof like this pseudocode:
1. for row in the first dataframe:
2. find all rows in the other dataframe with the same 'by' column values
3. add the remaining columns of 'other' for these rows to this one
...except it's all done in one operation - you don't have to write out all that code.
Unfortunately there is a problem with our join!
Something has gone wrong though. Can you see what it is?
(Tip: what countries are in the joined data? How many rows are there for each? Hint. you could use group by and sumarise to find this out.)
The different flavours of join
There are two problems in the data above, and they illustrate the behaviour of join.
Problem 1: repeated values
The first problem is that one of the countries actually was repeated in the continents
data frame. (Did you
notice this? Which one was it? You could use group by / summarise, or just a simple table()
to find out.)
The join has done its job properly: it has joined each row of reformatted
onto each matching row of continents
.
If there were two matching rows of continents
, you'll of course get two rows in the output. (Our data frame has 16
rows for most countries, so that would mean two of each of these - 32 rows in the output - in total...)
To fix this we'd better remove the duplicated row. It happens to be row number 5:
continents = continents[-5,]
Now join them again to fix that problem.
Problem 2: missing values
The second problem is that... hey, we've lost most of our countries! (For example - where is the UK?).
This illustrates another issue with the join above. It has removed any rows that don't match.
This is actually a feature of the 'inner' type of join we used above and it can be solved by trying a different join operation. Specifically we could use a 'left join' which keeps all rows in the first dataframe even if nothing matches:
(
reformatted
%>% left_join( continents, by = "country" )
)
Now we get all our rows back! (Although most of them have missing continent values.)
The moral of the story is:
Before joining, make sure you know what is in your data
In particular you should know what missing values there are, and what if any repeat values.
Different join forms
If you think about this for a moment you'll see there are four sensible behaviours a join could try here - and each of them has its own name / function. Here is a quick guide:
- "Inner" join (
inner_join()
in dplyr) returns only rows that match between both datasets - "Left" join (
left_join()
in dplyr) return all rows of the left table, filling in missing data on the right as needed. - "Right" join (
right_join()
in dplyr) return all rows of the right table, filling in missing data on the left as needed. - "Full" or "outer" join (
full_join()
in dplyr) return all rows, filling in missing data in the eft or right table as needed.
Try testing these four joins using these two data frames. (We'll make them using tribble, which is a handy way to make data frames by row).
A = tibble::tribble(
~ID, ~value_1,
'a', 1,
'b', 2,
'c', 3,
'd', 4
)
B = tibble::tribble(
~ID, ~value_2,
'c', 3,
'd', 4,
'e', 5,
'f', 6
)
Try joining A to B (by ID
), using either an inner, left, right, or full join. Do you understand the results?
A join example
Let's use join()
to compare two countries against each other - for example, the UK and Kenya:
UK = reformatted %>% filter( country == 'UK' ) %>% select( country, year, life_expectancy )
kenya = reformatted %>% filter( country == 'Kenya' ) %>% select( country, year, life_expectancy )
combined = (
inner_join( UK, kenya, by = "year" )
)
(Actually we can improve that by giving the columns better names):
combined = (
inner_join( UK, kenya, by = "year", suffix = c( "_UK", "_Kenya" ) )
)
Maybe you want that all in one go?
combined = (
reformatted
%>% filter( country == 'UK' ) %>% select( country, year, life_expectancy )
%>% inner_join(
reformatted
%>% filter( country == 'Kenya' ) %>% select( country, year, life_expectancy )
)
)
:::