Controlling for covariates

As made clear on the previous page, association tests might be susceptible to confounding. One of the great things about regression is that it's easy to control - or at least try to control - for this, assuming we have measured relevant variables.

Controlling for population structure

In this study, there is one variable that could definitely act as a potential confounder in each country: population structure, or in other words, variation in the genetic background of different study participants. It's a potential confounder because:

genetic background tends to vary geographically.
consequently ancestry could easily be correlated with the phenotype, either via environmental variation or differences in rates of recruitment to the study.
But different genetic backgrounds could definitely be associated with different frequencies of genotypes! (That's what the it means.)

Population structure is one of the classic potential confounders.

\text{genotype} \leftarrow \text{genetic background} \rightarrow \text{phenotype}

Luckily we have a couple of variables that tell us about this - a column containing the self-reported ethnicity of each participant, and also some principal components computed from genome-wide data. Let's use those to convince ourselves the association is not caused by this possible confounder.

Controlling for ethnic background

Let's focus on Kenya first, and re-fit our model across all the data, with and without ethnicity as a covariate.

kenya = data %>% filter( country == 'Kenya' )
original = glm(
    status ~ o_bld_group,
    data = kenya,
    family = "binomial"
)

adjusted = glm(
    status ~ o_bld_group + ethnicity,
    data = kenya,
    family = "binomial"
)

summary(original)$coeff
summary(adjusted)$coeff

Question

Has including the ethnicty made a substantial difference to our estimate of the effect of O blood group?

Question

Stop for a moment and use table() or similar to work out what ethnicities were in the data from Kenya.

Now look at the estimates that appear in the output. Which ethnicities do you see in the row names? (Note this is actually real data, but we have made up the ethnic group names to keep the originals private.)

This illustrates two key points about how glm() handles categorical variables.

First, if you add a categorical variable into the regression, it really gets treated like a series of individual $0-1$ variables. For example, here is has a variable ethnicity_18 which is a $1$ if the individual had that ethnicity and a $0$ if not. That's why each ethnicity gets its own output in the results.

A second point that you may have noticed, is that one ethnic group is missing from the output! (Which one?) The reason is that we already have a global baseline parameter: the 'intercept'. To avoid having two overall baseline levels, one of the ethnicities has been removed. An alternative way to solve this is to remove the overall intercept, which can be done by adding -1 to the formula like this:

adjusted2 = glm(
    status ~ o_bld_group + ethnicity - 1,
    data = kenya,
    family = "binomial"
)
summary(adjusted2)$coeff

This produces output something like this:

> summary(adjusted2)$coeff
                        Estimate Std. Error   z value     Pr(>|z|)
o_bld_group           -0.4029714 0.07834481 -5.143562 2.695779e-07
         ethnicity_17 -0.4647643 0.08083004 -5.749896 8.929836e-09
         ethnicity_18  0.4206776 0.06748674  6.233485 4.561713e-10
         ethnicity_19 -0.3090617 0.13314095 -2.321312 2.027002e-02
         ethnicity_20  0.6966946 0.14821850  4.700456 2.595808e-06

You should see there's no intercept any more, but we didn't need it because effectively there's one intercept per ethnicity.

Controlling for principal components

Another way to control for population structure is to use 'principal components' computed from genome-wide genetic data. This has the advantage of being directly based on the thing we'd like to control for - the genetic background of participants - and doesn't rely on self-reporting.

We won't go into how to compute principal components here, but in the data there are columns PC_1, PC_2 etc. which reflect the genetic structure of the samples in Kenya.

Note

You could start by plotting the first two PCs like this:

(
    ggplot( data = kenya )
    + geom_point( aes( x = PC_1, y = PC_2, col = ethnicity ))
    + theme_minimal()
)

You should see that there is real structure in this Kenyan population, and indeed it is correlated with self-reported ethnic group.

Let's run our test again controlling for these principal components instead:

adjusted3 = glm(
    status ~ o_bld_group + PC_1 + PC_2 + PC_3 + PC_4 + PC_5,
    data = kenya,
    family = "binomial"
)

summary(adjusted3)$coeff

Has this changed our estimates?

Note

What if you include both ethnicity and the PCs in the same regression?

Revisiting the forest plot

Earlier we made a forest plot without covariates. Let's expand our test to see if including PCs or ethnicity makes much difference across the other populations.

You should have the fit_regression() helper function from that page - if not go back and paste it in again. Check that it works in Kenya:

fit_regression( kenya, formula = status ~ o_bld_group + ethnicity + PC_1 + PC_2 + PC_3 + PC_4 + PC_5 )

You should get the same output as above, but in a slightly different data frame form that is more useful for forest plotting. Now let's get the unadjusted and adjusted estimates and put them in one output table:

estimates = (
    data
    %>% group_by( country )
    %>% reframe(
        fit_regression(
            pick( everything()),
            formula = status ~ o_bld_group
        )
    )
    %>% mutate( type = "unadjusted")
)

estimates = rbind(
    estimates,
    data
    %>% group_by( country )
    %>% reframe(
        fit_regression(
            pick( everything()),
            formula = status ~ o_bld_group + ethnicity + PC_1 + PC_2 + PC_3 + PC_4 + PC_5
        )
    )
    %>% mutate( type = "adjusted")
)

And plot it again:

p = (
    ggplot(
        data = estimates %>% filter( parameter == 'o_bld_group' )
    )
    + geom_segment(
        aes(
            x = exp( estimate - 1.96 * se ),
            xend = exp( estimate + 1.96 * se ),
            y = country,
            yend = country
        ),
        colour = 'grey',
        linewidth = 1.5
    )
    + geom_point(
        aes(
            x = exp( estimate ),
            y = country,
        ),
        size = 4
    )
    + facet_wrap( ~type )
    + xlab( "Odds ratio and 95% CI" )
    + ylab( "Country" )
    + geom_vline( xintercept = 1, linewidth = 1.5, linetype = 2 )
    + scale_x_log10( limits = c( 0.4, 1.2 ) )
    + theme_minimal(16)
    + theme( axis.title.y = element_text( angle = 0, vjust = 0.5 ))
)

print(p)

It should look somethig like this:

Beautiful!

Question

Do the unadjusted and adjusted effects differ?

Is there any evidence that the effects are due to confounding?

Question

Which of these statements do you agree with now?

O blood group is associated with lower chance of having severe malaria, relative to A/B blood group.
O blood group is associated with higher chance of having severe malaria, relative to A/B blood group.
O blood group is protective against severe malaria.
We can't tell if O blood group is protective or confers risk against severe malaria.

Further steps

Congratulations! This is the end of the main tutorial.

But mayube you want some more things to try? We have put a couple of extra challenges in the extra/ folder.

Try testing for association with an additive effect instead.
Try testing whether A/B blood group is also associated.

Controlling for covariates

Controlling for population structure​

Controlling for ethnic background​

Controlling for principal components​

Revisiting the forest plot​

Further steps​

Controlling for population structure

Controlling for ethnic background

Controlling for principal components

Revisiting the forest plot

Further steps