A/B example worked solution

Below is my worked solution to the A/B blood group quetion - don't peek unless you are stuck!

A basic approach

A basic approach is to test directly.

fit5 = glm(
    status ~ rs8176746_dosage + country,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)
summary(fit5)$coeff

                    Estimate Std. Error    z value     Pr(>|z|)
(Intercept)      -0.07304875 0.03118217 -2.3426452 1.914758e-02
rs8176746_dosage  0.17814569 0.03573289  4.9854826 6.180734e-07
(etc.)

Hey - rs8176746 looks associated as well! Is B blood type associated with higher risk?

However, what happens if we put them both in at once?

fit5b = glm(
    status ~ o_bld_group + rs8176746_dosage + country,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)
summary(fit5b)$coeff

                    Estimate Std. Error    z value     Pr(>|z|)
(Intercept)       0.10396026 0.04122405  2.5218350 1.167444e-02
o_bld_group      -0.30265008 0.04505738 -6.7169930 1.855129e-11
rs8176746_dosage  0.04228134 0.04143095  1.0205255 3.074793e-01
(etc.)

Oh. Now rs8176746 doesn't look associated.

So what's going on?

One way to think about this is to think of what the baseline level in the regression is - that's the level of predictors that only gives the baseline linear predictor:

For fit5, the baseline is everyone who has rs8176746 == G/G.
For fit5b, however, the baseline is everyone who has rs8176746 == G/G and has non-O blood type.

Even though the two models look similar, they are measuring different things. In the first fit (fit5), the baseline group includes a bunch of people who have O blood type, but in the second fit it doesn't.

(If we believe O blood group is protective, this is another way of saying it is having a confounding effect.)

Encoding A/B/O directly

A better way to solve this problem is to encode the biologically relevant variable directly. The biology works as follows: each individual has two chromosomes, and each carries the determinant of either the A or B antigen. Each chromosome might also carry a the loss-of-function 'O' deletion. Based on this we can call A/B/O blood type as follows:

combined genotype  blood group phenotype
-----------------  ---------------------
    C/C  G/G               A
    C/C  G/T               AB
    C/C  T/T               B
    -/C  G/G               A
    -/C  G/T            B or A? (*)
    -/C  T/T               B
    -/-  G/G               O
    -/-  G/T               O
    -/-  T/T               O

The cell marked (*) is the only difficult one here - both variants are heterozygous and we don't know from this data how they associated together on the chromosomes. However, here we are helped by the fact that O blood type mutation almost always occurs on the 'A' type background (i.e. chromosomes with 'G' allele at rs8176746). You can see this by tabling the two variants:

table( data$rs8176719, data$rs8176746 )

      G/G  G/T  T/T
 -/- 4621  264    7
 -/C 2331 2244   60
 C/C  331  510  287

With a few exceptions, all type O individuals have G/G genotype at rs8176746; the heterozygous -/C individuals are also consistent with most O type haplotypes carrying the 'G' allele. For the sake of this tutorial we will therefore assume that these doubly-heterozygous individuals have B blood type. Let's encode this now:

data$abo_type = factor( NA, levels = c( "A", "B", "AB", "O" ))
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/T' ] = 'AB'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/-' ] = 'O'

...and fit it:

fit6 = glm(
    status ~ abo_type + country,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)
summary(fit6)$coeff

The baseline is now, of course, 'A' blood type individuals.

Question

Is there any evidence that B, or AB blood type is associated with a different risk of malaria, compared to A?

A/B example worked solution

A basic approach​

Encoding A/B/O directly​

A basic approach

Encoding A/B/O directly