Skip to main content

A/B example worked solution

Below is my worked solution to the A/B blood group quetion - don't peek unless you are stuck!

A basic approach

A basic approach is to test directly.

fit5 = glm(
status ~ rs8176746_dosage + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit5)$coeff
                    Estimate Std. Error    z value     Pr(>|z|)
(Intercept) -0.07304875 0.03118217 -2.3426452 1.914758e-02
rs8176746_dosage 0.17814569 0.03573289 4.9854826 6.180734e-07
(etc.)

Hey - rs8176746 looks associated as well! Is B blood type associated with higher risk?

However, what happens if we put them both in at once?

fit5b = glm(
status ~ o_bld_group + rs8176746_dosage + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit5b)$coeff
                    Estimate Std. Error    z value     Pr(>|z|)
(Intercept) 0.10396026 0.04122405 2.5218350 1.167444e-02
o_bld_group -0.30265008 0.04505738 -6.7169930 1.855129e-11
rs8176746_dosage 0.04228134 0.04143095 1.0205255 3.074793e-01
(etc.)

Oh. Now rs8176746 doesn't look associated.

So what's going on?

One way to think about this is to think of what the baseline level in the regression is - that's the level of predictors that only gives the baseline linear predictor:

  • For fit5, the baseline is everyone who has rs8176746 == G/G.
  • For fit5b, however, the baseline is everyone who has rs8176746 == G/G and has non-O blood type.

Even though the two models look similar, they are measuring different things. In the first fit (fit5), the baseline group includes a bunch of people who have O blood type, but in the second fit it doesn't.

(If we believe O blood group is protective, this is another way of saying it is having a confounding effect.)

Encoding A/B/O directly

A better way to solve this problem is to encode the biologically relevant variable directly. The biology works as follows: each individual has two chromosomes, and each carries the determinant of either the A or B antigen. Each chromosome might also carry a the loss-of-function 'O' deletion. Based on this we can call A/B/O blood type as follows:

combined genotype  blood group phenotype
----------------- ---------------------
C/C G/G A
C/C G/T AB
C/C T/T B
-/C G/G A
-/C G/T B or A? (*)
-/C T/T B
-/- G/G O
-/- G/T O
-/- T/T O

The cell marked (*) is the only difficult one here - both variants are heterozygous and we don't know from this data how they associated together on the chromosomes. However, here we are helped by the fact that O blood type mutation almost always occurs on the 'A' type background (i.e. chromosomes with 'G' allele at rs8176746). You can see this by tabling the two variants:

table( data$rs8176719, data$rs8176746 )
      G/G  G/T  T/T
-/- 4621 264 7
-/C 2331 2244 60
C/C 331 510 287

With a few exceptions, all type O individuals have G/G genotype at rs8176746; the heterozygous -/C individuals are also consistent with most O type haplotypes carrying the 'G' allele. For the sake of this tutorial we will therefore assume that these doubly-heterozygous individuals have B blood type. Let's encode this now:

data$abo_type = factor( NA, levels = c( "A", "B", "AB", "O" ))
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/T' ] = 'AB'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/-' ] = 'O'

...and fit it:

fit6 = glm(
status ~ abo_type + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit6)$coeff

The baseline is now, of course, 'A' blood type individuals.

Question

Is there any evidence that B, or AB blood type is associated with a different risk of malaria, compared to A?