Skip to main content

Computing with (co)variance

In the confounder simulation, we simulated a variable BB with variance 11, and then made AA another variable depending on BB like this:

A=12B+noiseA = \tfrac{1}{2} B + \text{noise}

or in R:

A = 0.5 * B + rnorm( N, mean = 0, sd = something )

In our study we wanted AA to have variance 11 as well, and the question was: what standard deviation should we put tehre to make this work?

It turns out that, if BB has variance 11, and we want AA to have variance 11, then adding something with variance exactly 0.750.75 is the right thing here. Why is that?

The calculation of this is not very hard - it depends on the properties of the variance. This page explains it.

Properties of the variance

Variance turns out to have two key properties which make this type of calculation easy. The properties are:

  1. Variance property 1: The variance of a multiple of any variable XX scales like the multiple squared:
variance(a×X)=a2×var(X)\begin{align} \text{variance}( a \times X ) = a^2 \times \text{var}(X) \end{align}

and

  1. Variance property 2: The variance of two independent things added, var(X+Y)\text{var}(X + Y) if XX and YY are independent, is just the sum of their variances
var(X+Y)=var(X)+var(Y)\begin{align} \text{var}(X+Y) = \text{var}(X) + \text{var}(Y) \end{align}

These rules are just what we need to do the calculation above: the first lets us figure out how much variance the contribution of 12B\tfrac{1}{2} B contributes to AA, and the second rule lets us work out how much more variance we need to add.

Challenge

Suppose we instead added only 0.25 of B to A:

A = 0.25 * B + rnorm( N, mean = 0, sd = something )

What 'something' do we need here?

Use the above properties to work this out on a piece of paper. (Or use the tabs to see some hints.)

Properties of the covariance

That 'square-the-variable' behaviour always seems a bit complicated to me. I actually find these rules easiest to remember in this way:

  • The covariance cov(X,Y)\text{cov}(X,Y) between two variables is a bilinear function.

That is - it ehaves like a linear (straight-line!) function of each of its two variables.

In other words, it is linear in the first term:

cov(aX,Y)=a×cov(X,Y)cov(X+Y,Z)=cov(X,Z)+cov(Y,Z)\text{cov}(aX,Y) = a\times\text{cov}(X,Y) \qquad \text{cov}(X+Y,Z) = \text{cov}(X,Z) + \text{cov}(Y,Z)

and it's also linear in the second term:

cov(X,aY)=a×cov(X,Y)cov(X,Y+Z)=cov(X,Y)+cov(X,Z)\text{cov}(X,aY) = a\times\text{cov}(X,Y) \qquad \text{cov}(X,Y+Z) = \text{cov}(X,Y) + \text{cov}(X,Z)

It's also symmetric:

cov(X,Y)=cov(Y,X)\text{cov}(X,Y) = \text{cov}(Y,X)

Covariance is a measure of the co-linearity of two variables (around their mean). It gets bigger the larger the variables are, and bigger the more they tend to take the same values (after subtracting their mean). What's more, the variance of a variable XX is just the covariance of XX with itself:

var(X)=cov(X,X)\text{var}(X) = \text{cov}(X,X)

The two rules for variance given above boil down to applying the bi-linearity property to the variance, as in:

var(aX)=cov(aX,aX)=a2×var(X)\text{var}(aX) = \text{cov}(aX,aX) = a^2 \times \text{var}(X)

which is the first property, and

var(X+Y)=cov(X+Y,X+Y)=cov(X,X)+cov(Y,Y)+2×cov(X,Y)\begin{align} \text{var}(X+Y) = \text{cov}(X+Y,X+Y) = \text{cov}(X,X) + \text{cov}(Y,Y) + 2\times \text{cov}(X,Y) \end{align}

which is a more general form of the second property. (If XX and YY are independent, their covariance is zero, so the last term vanishes and this is the same as the one above in that case.)

Example

The last formula lets us work out more complex scenarios. For example, suppose again that

A=12B+noiseA = \tfrac{1}{2} B + \text{noise}

with variance 11, and suppose we then simulated a third variable CC as

C=12A+12B+noiseC = \tfrac{1}{2}A + \tfrac{1}{2} B + \text{noise}

...and we again wanted CC to have variance 11. How much variance do we need? The calculation is easy using formula (3):

var(12A+12B)=cov(12A+12B,12A+12B)=cov(12A,12A)+cov(12B,12B)+2×cov(12A,12B)=14cov(A,A)+14cov(B,B)+24cov(A,B)=14var(A)+14var(B)+12cov(A,B)\begin{align*} \text{var}(\tfrac{1}{2}A + \tfrac{1}{2} B) &= \text{cov}(\tfrac{1}{2}A + \tfrac{1}{2} B, \tfrac{1}{2}A + \tfrac{1}{2} B) \\ &= \text{cov}(\tfrac{1}{2}A, \tfrac{1}{2}A) + \text{cov}(\tfrac{1}{2}B, \tfrac{1}{2}B)+ 2\times \text{cov}(\tfrac{1}{2}A, \tfrac{1}{2}B) \\ & = \tfrac{1}{4}\text{cov}(A,A) + \tfrac{1}{4}\text{cov}(B,B) + \tfrac{2}{4}\text{cov}(A,B) \\ & = \tfrac{1}{4}\text{var}(A) + \tfrac{1}{4}\text{var}(B) + \tfrac{1}{2}\text{cov}(A,B) \end{align*}

In our computation AA and BB had variance 11, while we had

cov(A,B)=12\text{cov}(A,B) = \tfrac{1}{2}

because of how AA was simulated. So this boils down to

var(12A+12B)=14+14+14=34\text{var}(\tfrac{1}{2}A + \tfrac{1}{2} B) = \tfrac{1}{4} + \tfrac{1}{4} + \tfrac{1}{4} = \frac{3}{4}

In other words, we need to add noise with variance 14\tfrac{1}{4} to make CC have variance 11.