Controlling for covariates

Association tests might be susceptible to confounding. One of the great things about logistic regression is that it's easy to try to control for this - assuming we have measured relevant variables.

Controlling for sampling unit: country

In this study, there is one variable that could defnitely be a potential confounder: the country. This is because samples were collected seperately in each country - by different people and under different sampling schemes. Might this be skewing results?

To find this out, let's re-fit our model with country as a covariate:

fit2 = glm(
    status ~ o_bld_group + country,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)
summary(fit2)$coeff

Question

How many samples from each country were in the data?

Now look hard at the estimates that appear in the output. Which countries do you see in the row names?

This illustrates the way that regression handles categorical variables like the country by default. One country is chosen as a 'baseline' (in an ad hoc manner) and the variation in the other countries are measured against it. Which country did glm pick as the baseline here? Which countries have higher O blood group frequencies and which have lower?

Question

Has including the country made a substantial difference to our estimate of the effect of O blood group?

If you want to control which country is the baseline, turn the column into a factor with the chosen ordering. For example let's order countries roughly west-to-east:

data$country = factor(
    data$country,
    levels = c(
        "Gambia",
        "Ghana",
        "Cameroon",
        "Tanzania",
        "Kenya"
    )
)

Now re-run the regression - can you see the difference?

Controlling for genetic background

A major possible confounder for genetic association studies is ethnic group, or more precisely the genetic background of individuals.

For example, for all we know it could be that O blood group is more common in some ethnic groups. If those happen to be less prone to getting severe malaria or less likely to get enrolled in the study. A simple way to assess this is to include ethnicity as a covariate in our test as well. Let's go back to doing this on a country level:

fit3 = glm(
    status ~ o_bld_group + country + ethnicity,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)
summary(fit3)$coeff

Note

This produces a long output! Make sure you know what it's showing.

However in genome-wide association studies another way to assess this is often used. We estimate the genetic background using genome-wide genotyping data. This is often done using a method called principal components analysis, which analyses the genetic similarities between individuals to identify major directions of genome-wide genetic variation.

Note. Genome-wide genetic variation typically varies with geography as well as ethnicity, and so controlling for it can pick up a range of possible confounding features, such as geographically varying environmental effects.

Let's try controlling for principal components now as well. I've computed these already and they are in the data in the columns labelled PC_1, \cdots, PC_2. To see what these look like, let's plot the first two in each country:

print(
    ggplot(
        data = data,
        mapping = aes( x = PC_1, y = PC_2, colour = ethnicity )
    )
    + geom_point()
    + facet_wrap( ~country )
)

You should see that there is indeed quite a bit of genetic population structure in most countries in the data!

(Note. This is real data from African populations - this genetic structure is genuine and reflects the complex demographic history of the populations.)

These PCs were computed seperately for each conutry, so let's go back to making estimates at a country level:

forest_plot_data2 = tibble()
populations = unique( data$country )
for( population in populations ) {
    fit = glm(
        status ~ o_bld_group + PC_1,
        family = "binomial",
        data = data %>% filter( country == population )
    )
    coeffs = summary(fit)$coefficients
    forest_plot_data2 = bind_rows(
        forest_plot_data2,
        tibble(
            population = population,
            beta_hat = coeffs['o_bld_group',1],
            se = coeffs['o_bld_group',2],
            p = coeffs['o_bld_group',4],
            PC1_beta = coeffs[ 'PC_1',1],
            PC1_se = coeffs[ 'PC_1',2],
            PC1_p = coeffs[ 'PC_1',4]
        )
    )
}

You can plot them again:

forest_plot( forest_plot_data2 )

Question

Does this plot differ much from your earlier plot without covariates? What happens if you include all 5 PCs? Can you 'destroy' the association by including covariates?

What do you conclude about the association between O blood group and malaria susceptibility in each country?

Question

Which of these statements do you agree with now?

O blood group is associated with lower chance of having severe malaria, relative to A/B blood group.
O blood group is associated with higher chance of having severe malaria, relative to A/B blood group.
O blood group is protective against severe malaria.
We can't tell if O blood group is protective or confers risk against severe malaria.

A challenge question

Congratulations! This is the end of this tutorial.

If you want some more examples to try, please see the extras section.

Controlling for covariates

Controlling for sampling unit: country​

Controlling for genetic background​

A challenge question​

Controlling for sampling unit: country

Controlling for genetic background

A challenge question