A/B example worked solution
Below is my worked solution to the A/B blood group quetion - don't peek unless you are stuck!
A basic approach
A basic approach is to test directly.
fit5 = glm(
status ~ rs8176746_dosage + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit5)$coeff
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.07304875 0.03118217 -2.3426452 1.914758e-02
rs8176746_dosage 0.17814569 0.03573289 4.9854826 6.180734e-07
(etc.)
Hey - rs8176746 looks associated as well! Is B blood type associated with higher risk?
However, what happens if we put them both in at once?
fit5b = glm(
status ~ o_bld_group + rs8176746_dosage + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit5b)$coeff
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.10396026 0.04122405 2.5218350 1.167444e-02
o_bld_group -0.30265008 0.04505738 -6.7169930 1.855129e-11
rs8176746_dosage 0.04228134 0.04143095 1.0205255 3.074793e-01
(etc.)
Oh. Now rs8176746 doesn't look associated.
So what's going on?
One way to think about this is to think of what the baseline level in the regression is - that's the level of predictors that only gives the baseline linear predictor:
- For
fit5
, the baseline is everyone who hasrs8176746 == G/G
. - For
fit5b
, however, the baseline is everyone who hasrs8176746 == G/G
and has non-O blood type.
Even though the two models look similar, they are measuring different things. In the first fit (fit5
), the baseline group
includes a bunch of people who have O blood type, but in the second fit it doesn't.
(If we believe O blood group is protective, this is another way of saying it is having a confounding effect.)
Encoding A/B/O directly
A better way to solve this problem is to encode the biologically relevant variable directly. The biology works as follows: each individual has two chromosomes, and each carries the determinant of either the A or B antigen. Each chromosome might also carry a the loss-of-function 'O' deletion. Based on this we can call A/B/O blood type as follows:
combined genotype blood group phenotype
----------------- ---------------------
C/C G/G A
C/C G/T AB
C/C T/T B
-/C G/G A
-/C G/T B or A? (*)
-/C T/T B
-/- G/G O
-/- G/T O
-/- T/T O
The cell marked (*) is the only difficult one here - both variants are heterozygous and we don't know from this data how they associated together on the chromosomes. However, here we are helped by the fact that O blood type mutation almost always occurs on the 'A' type background (i.e. chromosomes with 'G' allele at rs8176746). You can see this by tabling the two variants:
table( data$rs8176719, data$rs8176746 )
G/G G/T T/T
-/- 4621 264 7
-/C 2331 2244 60
C/C 331 510 287
With a few exceptions, all type O individuals have G/G genotype at rs8176746; the heterozygous -/C individuals are also consistent with most O type haplotypes carrying the 'G' allele. For the sake of this tutorial we will therefore assume that these doubly-heterozygous individuals have B blood type. Let's encode this now:
data$abo_type = factor( NA, levels = c( "A", "B", "AB", "O" ))
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'G/T' ] = 'AB'
data$abo_type[ data$rs8176719 == 'C/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/G' ] = 'A'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'G/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/C' & data$rs8176746 == 'T/T' ] = 'B'
data$abo_type[ data$rs8176719 == '-/-' ] = 'O'
...and fit it:
fit6 = glm(
status ~ abo_type + country,
data = data,
family = "binomial" # needed to specify logistic regression
)
summary(fit6)$coeff
The baseline is now, of course, 'A' blood type individuals.
Is there any evidence that B, or AB blood type is associated with a different risk of malaria, compared to A?