Fitting a logistic regression

Let's get started by fitting a logistic regression model. Logistic regression is used to test for association with binary outcome variables (such as disease status) - or more generally with categorical outcomes. It's one of the basic building blocks, for example, of many genome-wide association study analyses, so it is worth learning how to interpret it.

Getting the data

The data for this practical can be found at this link.

Right-click and press 'copy link'.

Now start a new R session and load the data in like this:

library(tidyverse)
data = read_tsv( <paste your link here> )

Note

If the above link doesn't work for any reason, try the link on Canvas.

Or try this version. (To get the URL, click 'Raw' and then copy the URL.)

What's in the data?

The data contains genotype calls for the genetic variant rs8176719 (and rs8176746), measured in a set of severe malaria cases and population controls from several African countries. These variants lie in the ABO gene and together encode the most common ABO blood group types.

In particular, rs8176719 is the main variant that determines O blood group status. It represents a one base-pair deletion that causes a frameshift and early termination of transcription. If you wish, you can read more about ABO and the O blood group mutation here.

Note

If you want to see what the genetic effect of these variants is, you can also look it up in the UCSC genome browser and/or the Ensembl genome browser. For example you should see that rs8176719 is a frameshift variant within ABO gene. (Which exon is it in?). Chromosomes with the deletion allele (like the GRCh38 reference sequence) produce a malformed enzyme that prevents synthesis of A or B antigens (they cause a loss of glycosyltransferase activity).

(Of course, humans have two copies so if the sample is heterozygous then it's quite possible the other copy does express A or B.)

The data were collected by seperate case-control studies operating in each country and were genotyped by MalariaGEN. They were described in this paper.

Before starting, explore the data (e.g. using View(), table() and so on) and make sure you know what is in there. You should see:

Information on the sample identifier and sex, and ethnicity
The status column, which indicates if the sample was collected as a disease case (i.e. came into hospital with severe symptoms of malaria) or a population control.
A column with the rs8176719 genotypes (and one with the rs8176746 genotypes.)
And some columns marked PC_1 - PC_5.

Warm-up questions

What countries are represented in the data?

In each country, how many samples are cases and how many controls?

Is O blood group associated with malaria susceptibility?

A and B antigens are bound to the surface of red cells and tend to stick to things - it's plausible malaria parasites might exploit this. Let's therefore use logistic regression to test to see if O blood group is associated with malaria status. Logistic regression is similar to fitting a linear regression (as we did earlier but works for a binary outcome variable, such as the 'severe malaria' status in this data.

To get started let's create a new variable indicating 'O' blood group status:

# Set to missing first, in case any rs8176719 genotypes are missing
data$o_bld_group = NA           
data$o_bld_group[ data$rs8176719 == '-/-' ] = 1
data$o_bld_group[ data$rs8176719 == '-/C' ] = 0
data$o_bld_group[ data$rs8176719 == 'C/C' ] = 0

Question

What is the frequency of O blood group in this sample, in each population?

Hints and solutions

In case you are stuck, try the following hints and solutions.

Hint 1
Hint 2
Table solution
Dplyr hint
Dplyr solution

One way to do this is using the table() function. You can then use the results to compute the frequencies.

Another way is to build a pipeline using dplyr.

For example, this builds a table of counts:

table( data$country, data$o_bld_group )

To compute frequencies, you would then need to capture the results to compute frequencies, like this:

counts = table( data$country, data$o_bld_group )
frequencies = counts[,2] / (counts[,1] + counts[,2])

Another, longer (but cooler) way is to build a 'pipeline' using dplyr. For example you can:

use filter() to remove rows where the o_bld_group variable is missing
use group_by() to group by country
use summarise() to count the number of o blood group individuals, and the total, for example like this:

... %>% summarise( total = n(), o = sum( o_bld_group ))

and use mutate() to add the frequency,

%>% mutate( o_frequency = o/total )

Can you do it?

(
    data
    %>% filter( !is.na( o_bld_group ))
    %>% group_by( country )
    %>% summarise(
        total = n(),
        o = sum( o_bld_group )
    )
    %>% mutate( o_frequency = o / total )
)

# A tibble: 5 × 4
  country  total     o o_frequency
  <chr>    <int> <dbl>       <dbl>
1 Cameroon  1182   586       0.496
2 Gambia    5027  2152       0.428
3 Ghana      694   280       0.403
4 Kenya     3054  1546       0.506
5 Tanzania   769   350       0.455

However you do this you should see that blood group O is at about 40-50% frequency in these African populations.

Running the regression

For best results let's also make sure our status variable is coded the right way round (with "CONTROL" as the baseline):

data$status = factor( data$status, levels = c( "CONTROL", "CASE" ))

Fitting a logistic regression model is now very in R - we use the glm() function:

fit = glm(
    status ~ o_bld_group,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)

Note

This command says: fit a logistic regression model of severe malaria status modelled in terms of o blood group status, using the given dataframe. The 'family' argument is there to tell glm that the outcome variable is a binary (0/1) indicator, as glm can fit more general models as well.

Note. You can compare this to how we fit a linear model previously.

Congratulations! You've just fit a logistic regression model.

Interpreting the regression output

The best way to inspect the results of glm() is using the summary() command to get the fitted coefficients:

coefficients = summary(fit)$coefficients
print( coefficients )

You should see something like this:

> print( coefficients )
              Estimate Std. Error   z value    Pr(>|z|)
(Intercept)  0.1489326 0.02630689  5.661354 1.50183e-08
o_bld_group -0.3342393 0.03889801 -8.592709 8.49425e-18

Let's take a moment to parse this result to see what it says.

The parameter estimate

First, in the 'Estimate' column you see the fitted values of the parameters. There are, in fact, two parameters - one for something called the 'intercept', which we'll call $\hat{\mu}$ , and one for O blood group itself, which we'll call $\hat{\beta}$ .

The parameters are in fact expressed on the log odds scale. Specifically:

$\hat{\mu}$ is the estimated baseline log-odds of disease. (This is the log-odds of disease if you don't carry blood group O).
$\hat{\beta}$ is the estimated change in log-odds associated with blood group O versus non-O blood group.

The modelled log-odds of being a case is therefore $\hat{\mu}$ (for non-O blood group samples) and $\hat{\mu} + \hat{\beta}$ for O blood group samples.

(Note: the log-odds is just a transform of the probability through the logit function:

\text{logit}(x) = \log\left(\frac{x}{1-x}\right)

It maps probabilities (x axis) onto the whole real line (y axis), with 50% probability being mapped to a log-odds of zero.) Smaller log-odds correspond to smaller probabilities and vice versa.

Question

Which of these statements do you agree with:

O blood group is associated with lower chance of having severe malaria, relative to A/B blood group.
O blood group is associated with higher chance of having severe malaria, relative to A/B blood group.
O blood group is protective against severe malaria.
We can't tell if O blood group is protective or confers risk against severe malaria.

Logistic vs. linear regression

Linear regression recap

In the previous tutorial we fit a linear regression model to a continuous outcome variable $Y$ and a predictor variable $X$ - namely $Y$ = % GC content, and $X$ = chromosome length. The plot looked like this:

The linear regression model was:

Y = \mu + \beta X + \text{gaussian noise}

In other words - we modelled $Y$ has normally distributed, but with a mean that depended on $X$ .

The parameter estimates were interpreted as:

$\hat{\mu} = 44.3$ was the estimated baseline value of $Y$ , also known as the intercept.
$\hat{\beta} = -0.024$ was the estimated change in $Y$ for a unit increase in $X$ .

That is, we estimated that GC content drops by about 0.024% for each megabase increase in chromosome length.

In logistic regression, just like linear regression, we again model the probability that $Y$ takes particular values based on the linear predictor $\mu + \beta X$ . THere are two differences. First, instead of using a normal distribution, we use this to directly model the probability that $Y=1$ . And secondly we work on the log-odds scale instead.

The full formula for logistic regression is therefore:

\text{logodds}\left(Y=1\right) = \mu + \beta X

This is why the parameter estimates are interpreted on the log-odds scale, as described above, instead of the linear scale.

If you know the O blood group status of a sample in the study, you can use this to work out the modelled probability that they are a severe malaria case.

Challenge question

The function that transforms back from log-odds space to probability space is called the logistic function - it looks like this:

\text{logistic}(x) = \frac{e^x}{1+e^x}

or in R:

logistic <- function(x) {
    exp(x) / (1+exp(x))
}

Can you plot the logistic function now over a range around zero? It should look something like this:

Note how it maps the whole real line (x axis, log-odds space) back to the unit interval (y axis, probability space).

Use this function to answer the following questions:

In our fitted model, what is the estimated probability that a non-O blood group individual has severe malaria in this sample? Hint: apply the logistic function to the baseline parameter estimate $\hat{\mu}$ .)

What is the estimated probability that a O blood group individual has severe malaria in this sample? Hint: apply the logistic function to quantity $\hat{\mu} + \hat{\beta}$ .)

Caution. The probabilities in this model can not be interpreted as absolute population risks. This is because the sample is a case-control sample in which cases are over-sampled.

Hopefully you can see how the logistic model is similar to linear regression (it has the same linear predictor), but also different in how the outcome is modelled.

Logistic regression vs a 2x2 table

In fact this parameter estimate is just another way to do what you can do in a 2x2 table example. To see this, try making a table of the genotypes and outcomes:

table( data$status, data$o_bld_group )

           0    1
CONTROL 2690 2684
CASE    3122 2230

Let's think of this matrix as $\left(\begin{matrix} a & b \\ c & d\end{matrix}\right)$ and compute the sample odds ratio:

\text{odds ratio} = \frac{a/c}{b/d} = \frac{ad}{bc}

If you work this out for the above table you get:

odds_ratio = (2690*2230) / (3122*2684)

Question

Compute this sample odds ratio now, and compare it to the exponential of the parameter estimate:

exp( -0.3342393 )

What do you notice?

A key fact is that for an unadjusted regression like this (with no other variables in), the parameter estimate from logistic regression is just the logarithm of the odds ratio computed from a 2x2 table.

To fully interpret this estimate, there's a final piece of the puzzle.

Relative risk

In this study, disease cases hvae been over-sampled (about half of the study samples are severe malaria cases, even though severe malaria is much rarer in the population).

So to fully interpretable, imagine what is going on in the whole population, rather than the study now. Let's suppose that disease cases come along at some frequency $f_0$ (for individuals with non-O blood group) and $f_1$ (for individuals with O blood group). The simplest way to measure association in the population is using the relative risk:

\text{relative risk} = \frac{P(\text{disease}|\text{O bld grp})}{P(\text{disease}|\text{non-O bld grp})} = \frac{f_1}{f_0}

Key fact. In a case-control study, if the controls are 'unselected' (i.e. just taken as members of the general population), the study odds ratio is an estimate of the population relative risk. The reason is that the above can be re-written using Bayes rule in terms of genotype frequencies in cases and population controls:

\frac{P(\text{disease}|\text{O bld grp})}{P(\text{disease}|\text{non-O bld grp})} = \frac{P(\text{O bld grp}|\text{disease})}{P(\text{O bld grp}|\text{population})} / \frac{P(\text{non-O bld grp}|\text{disease})}{P(\text{non-O bld grp}|\text{population})}

The four terms there can be estimated by taking a sample of cases and population controls, and measuring their genotypes.

Thus, even though the baseline parameter $\mu$ doesn't reflect the population risk of disease, the odds ratio $e^\beta$ reflects the relative risk in the population.

In our case if the relative risk is about 0.7 - we could say O blood group is associated with about 30% reduction in risk of severe malaria.

Plotting 95% intervals

Look back at the regression output now:

> summary(fit)$coefficients
              Estimate Std. Error   z value    Pr(>|z|)
(Intercept)  0.1489326 0.02630689  5.661354 1.50183e-08
o_bld_group -0.3342393 0.03889801 -8.592709 8.49425e-18

Let's use it to plot a 95% confidence or credible interval using ggplot. For this we need a data frame:

plot_data = tibble(
    population = "all",
    beta_hat = -0.3342393,
    se = 0.03889801
)

As usual it's useful to write a re-useable function here. We'll make it add the confidence interval bounds and then plot. (By now you should know the magic number here, that a 95% confidence interval goes out 1.96 standard errors from the estimate):

forest_plot <- function(
    forest_plot_data,
    xlim = c( -0.6, 0.1 )
) { 
    forest_plot_data$lower = forest_plot_data$beta_hat - 1.96 * forest_plot_data$se
    forest_plot_data$upper = forest_plot_data$beta_hat + 1.96 * forest_plot_data$se
    (
        ggplot(
            data = forest_plot_data
        )
        + geom_segment(
            mapping = aes(
                x = lower,
                xend = upper,
                y = population,
                yend = population
            )
        )
        + geom_point(
            aes( x = beta_hat, y = population ),
            size = 4
        )
        + xlab( "Parameter estimate and confidence interval")
        + xlim( xlim )
        + geom_vline( xintercept = 0, linetype = 2 )
        + theme_minimal( 16 )
    )
}

forest_plot( plot_data )

Challenge

This plot is on the log-odds scale. What does it look like on the odds scale (hint: try the plot after exponentiating the estimate, lower and upper. Don't forget to change the x axis label! You will of course have to change the x axis limits - something like 0.5 - 1 should work.)

Note 2. For extra kudos, transform to the odds scale as above, but then add scale_x_log10() to draw on a log scale again.

Hint
Solution

Click on the 'Solution' tab above if you want to see a solution.

forest_plot <- function(
    forest_plot_data,
    xlim = c( 0.5, 1.1 )
) { 
    forest_plot_data$lower = forest_plot_data$beta_hat - 1.96 * forest_plot_data$se
    forest_plot_data$upper = forest_plot_data$beta_hat + 1.96 * forest_plot_data$se
    (
        ggplot(
            data = forest_plot_data
        )
        + geom_segment(
            mapping = aes(
                x = exp(lower),
                xend = exp(upper),
                y = population,
                yend = population
            )
        )
        + geom_point(
            aes( x = exp(beta_hat), y = population ),
            size = 4
        )
        + xlab( "Odds ratio and 95% confidence interval")
        + scale_x_log10( limits = xlim )
        + geom_vline( xintercept = 1, linetype = 2, lwd = 2 )
        + theme_minimal( 16 )
    )
}

forest_plot( plot_data )

Warning

In this module we are using the terms 'confidence interval' and 'credible interval' essentially interchangeably.

The term 'credible interval' is associated with the bayesian interpretation of this process, in which the estimate and interval provide a summary of our belief about the parameter estimate, having seen the data. (That is, a summary of the posterior distribution, in bayesian-speak). About 95% of the mass of the posterior is in this interval. This is the way we've presented it in lectures.

The formal definition of 'confidence interval' is more complicated (and probably not particularly helpful) - to understand it you have to imagine conceptually replicating this experiment many times over and counting the proportion of many times this process of going out 1.96 standard errors includes the 'true' parameter value in the interval - it should be about 95%.

The p-value and z score

Finally let's interpret the last two additional columns - the z score and the p-value. These in fact add no new information:

The z-score is just the effect estimate divided by the standard error (you can check this for yourself). It says "how many standard errors away from zero is the estimate"?
The p-value is another another measure of the same thing. It is the z-score transformed through the normal distribution cumulative distribution function (and multiplied by two, to capture both tails of the distribution):

\text{p-value} = \Phi( -|\text{z-score}| ) * 2.

You can confirm this for yourself in R by computing it directly:

# compute the p-value by hand:
pnorm( -0.3342393 / 0.03889801 ) * 2

You should see a value identical to the one output by the regression output.

The P-value is usually interpreted as saying "how likely would we be to see an estimate as strong as $\hat{\beta}$ "by chance" - i.e. purely due to sampling noise - if there were no true association?"

In our case the P-value is $8.5 \times 10^{-18}$ , so this association very unlikely purely due to chance.

Looking across populations

If O blood group is really protective, we'd expect the effect to be reasonably consistent across different subsets of our data - say across countries.

Let's assess that now by looking in different countries via a loop:

forest_plot_data = tibble()
populations = unique( data$country )

for( population in populations ) {
    fit = glm(
        status ~ o_bld_group,
        family = "binomial",
        data = data %>% filter( country == population )
    )
    coeffs = summary(fit)$coefficients
    forest_plot_data = bind_rows(
        forest_plot_data,
        tibble(
            population = population,
            beta_hat = coeffs['o_bld_group',1],
            se = coeffs['o_bld_group',2],
            p = coeffs['o_bld_group',4]
        )
    )
}

forest_plot( forest_plot_data )

Question

How do you interpret this plot?

Fitting a logistic regression

Getting the data​

What's in the data?​

Is O blood group associated with malaria susceptibility?​

Running the regression​

Interpreting the regression output​

The parameter estimate​

Logistic vs. linear regression​

Logistic regression vs a 2x2 table​

Plotting 95% intervals​

The p-value and z score​

Looking across populations​

Getting the data

What's in the data?

Is O blood group associated with malaria susceptibility?

Running the regression

Interpreting the regression output

The parameter estimate

Logistic vs. linear regression

Logistic regression vs a 2x2 table

Plotting 95% intervals

The p-value and z score

Looking across populations