Skip to main content

Simulating confounders

Let's simulate a confounding relationship now, and see what happens when we test for association. For simplicity we'll use continuous variables and linear regression.

Let's imagine we are interested in the effect of a variable AA and an outcome variable CC. But there is a confounder BB. What does that do to our association?

We'll simulate a classic confounding relationship like this:

img

So BB is a confounder: it affects both the predictor AA and the outcome CC.

Simulating

We'll make this as simple as possible, by making every variable be normally distributed. And all the effect sizes will be 0.5 as indicated in the diagram.

Let's start now, simulating data on N=1,000N = 1,000 samples, say.

N = 1000

First, since BB has no edges coming into it, we can simulate it simply by randomly drawing from a normal distribution:

B = rnorm( N, mean = 0, sd = 1 )

Second, since AA has only one arrow into it (from BB), we can simulate it as 0.5×B0.5\times B plus some noise. To keep things simple, we'll add 'just enough' noise so it has variance 11 again:

A = 0.5 * B + rnorm( N, mean = 0, sd = sqrt(0.75) )
Aside on variance computations

It turns out that adding something with variance 0.750.75 is just the right thing here to make AA have variance 11 again.

The reason for this is the property of the variance that, if XX and YY are independent, then

var(X+Y)=var(X)+var(Y)\text{var}(X+Y) = \text{var}(X) + \text{var}(Y)

In our case since BB has variance 11, 0.5×B0.5 \times B has variance 0.250.25. Therefore to make up to variance 11 we need to add some independent 'noise' with variance 0.750.75. That's what the code above does.

If you want to see more detail on how these variance computations are carried out, see the 'extras' page on computing with (co)variances.

Finally, let's simulate CC. It ought to be made up of a contribution from AA, a contribution from BB, and some additional noise:

C=0.5×A+0.5×B+noiseC = 0.5 \times A + 0.5 \times B + \text{noise}

How much noise? Again, the contributions from AA and BB each have variance 0.250.25. But they also covary which has to be taken account of. The correct formula turns out to be

C = 0.5*A + 0.5*B + rnorm( N, mean = 0, sd = sqrt( 0.25 ))
Aside on the calculation

Why is this? Again it uses the basic formula for adding variances:

var(X+Y)=var(X)+var(Y)+2cov(X,Y)\text{var}(X+Y) = \text{var}(X) + \text{var}(Y) + 2 \text{cov}(X,Y)

Here cov(X,Y)\text{cov}(X,Y) is the covariance between XX and YY - a measure of their colinearity (after subtracting the mean).

This and other formula are derived on the computing with (co)variance page.

If you apply this in our example adding up a half of each of AA and BB, it works out as:

var(12A+12B)=14var(A)+14var(B)+12×cov(A,B)\text{var}\left(\tfrac{1}{2} A + \tfrac{1}{2} B\right) = \tfrac{1}{4} \text{var}(A) + \tfrac{1}{4} \text{var}(B) + \tfrac{1}{2} \times \text{cov}(A,B)

Since AA and BB both have variance 11, and since the covariance of AA and BB is 0.5 because of how we computed AA, this works out as 0.75. So to make CC have variance 11 again we need to add independent noise with variance 0.25, which is what the line above does.

Let's check this all worked now by computing the covariance between variables.

simulated_data = tibble(
A = A,
B = B,
C = C
)
cov( simulated_data )

You should see something like this:

> cov( simulated_data )
A B C
A 1.0396708 0.5128317 0.7698858
B 0.5128317 1.0058647 0.7557131
C 0.7698858 0.7557131 1.0113318

Does this look right?

Testing for association

Testing without controlling for the confounder

Now let's fit our linear regression model of the outcome variable CC on AA:

fit1 = lm( C ~ A )
coeffs1 = summary(fit1)$coeff
print( coeffs1 )
Question

What estimate do you get for the effect of AA on CC? Does it capture the causal effect? Is it too high? Too low?

How confident is the regression in this estimate (e.g. how small is the standard error, or how low is the P-value?)

Testing after controlling for the confounder

The great thing about regression models (as opposed to to things like, say, T tests or 2x2 tables) is that they make it easy to control for confounders. Let's fit another linear regression fit that simultaneously fits both AA and BB as predictors:

fit2 = lm( C ~ A + B )
coeffs2 = summary(fit2)$coeff
print( coeffs2 )
Question

What estimate do you get for the effect of AA on CC? Does it capture the causal effect? Is it too high? Too low?

Is the estimate within a 95% interval of the true causal effect (i.e. of 0.50.5?)

Congratulations! By this point you should understand the basic concepts behind confounding and how they link to regression estimates. To check your understanding, try this challenge question:

Challenge question

Suppose we go back and re-simulate the data but having a true (causal) effect size of zero between AA and CC.

Draw the causal diagram and explain what you think will happen to the estimates now.

Now go back and re-simulate this scenario to confirm your expectations.

Try this for yourself! If unsure, you can look at the hints.