Fitting a logistic regression

Let's get started by fitting a logistic regression model. Logistic regression is used to test for association with binary outcome variables (such as disease status) - or more generally with categorical outcomes. It's one of the basic building blocks, for example, of many genome-wide association study analyses, so it is worth learning how to interpret it.

Getting the data

Start a new R session and load the data in like this:

library(tidyverse)
data = read_tsv( "https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2024/data/o_bld_group_data.tsv" )

Note

If the above link doesn't work for any reason, try the this version. (To load this, click 'Raw', copy the URL, and read into R as above.)

What's in the data?

The data contains genotype calls for the genetic variant rs8176719 (and rs8176746), measured in a set of severe malaria cases and population controls from several African countries. These variants lie in the ABO gene and together encode the most common ABO blood group types.

In particular, rs8176719 is the main variant that determines O blood group status. It represents a one base-pair deletion that causes a frameshift and early termination of transcription. If you wish, you can read more about ABO and the O blood group mutation here.

Note

If you want to see what the genetic effect of these variants is, you can also look it up in the UCSC genome browser and/or the Ensembl genome browser. For example you should see that rs8176719 is a frameshift variant within ABO gene. (Which exon is it in?). Chromosomes with the deletion allele (like the GRCh38 reference sequence) produce a malformed enzyme that prevents synthesis of A or B antigens (they cause a loss of glycosyltransferase activity).

(Of course, humans have two copies so if the sample is heterozygous then it's quite possible the other copy does express A or B.)

The data were collected by seperate case-control studies operating in each country and were genotyped by MalariaGEN. They were described in this paper.

Before starting, explore the data (e.g. using View(), table() and so on) and make sure you know what is in there. You should see:

Information on the sample identifier and sex, and ethnicity
The status column, which indicates if the sample was collected as a disease case (i.e. came into hospital with severe symptoms of malaria) or a population control.
A column with the rs8176719 genotypes (and one with the rs8176746 genotypes.)
And some columns marked PC_1 - PC_5.

We'll focus on the status and genotype columns for now.

Warm-up questions

Use table() or the data verbs to answer:

What countries are represented in the data, and what are the sample sizes?

In each country, how many samples are cases and how many controls?

Is O blood group associated with malaria susceptibility?

A and B antigens are bound to the surface of red cells and tend to stick to things - it's plausible malaria parasites might exploit this. Let's therefore use logistic regression to test to see if O blood group is associated with malaria status. Logistic regression is similar to fitting a linear regression (as we did earlier) but works for a binary outcome variable, such as the 'severe malaria' status in this data.

To get started let's create a new variable indicating 'O' blood group status:

# Set to missing first, in case any rs8176719 genotypes are missing
data$o_bld_group = NA           
data$o_bld_group[ data$rs8176719 == '-/-' ] = 1
data$o_bld_group[ data$rs8176719 == '-/C' ] = 0
data$o_bld_group[ data$rs8176719 == 'C/C' ] = 0

Question

What is the frequency of O blood group in this sample, and in each population?

Try do do this yourself. If you get stuck, look at the tabs above for hitnts.

You can get counts of genotypes either using the table() function, or the data verbs.

You then need to use these to compute allele frequencies.

For example, this builds a table of counts:

table( data$country, data$o_bld_group )

To compute frequencies, you would then need to capture the results to compute frequencies, like this:

counts = table( data$country, data$o_bld_group )
frequencies = counts[,2] / (counts[,1] + counts[,2])

A probably nicer way to do this is via the 'data verbs'. For example you can:

use filter() to remove rows where the o_bld_group variable is missing (watch out - there are a few!)
use group_by() to group by country
use summarise() to count the number of o blood group individuals, and the total, for example like this:

... %>% summarise( total = n(), o = sum( o_bld_group ))

and use mutate() to add the frequency,

%>% mutate( o_frequency = o/total )

Can you do it?

(
    data
    %>% filter( !is.na( o_bld_group ))
    %>% group_by( country )
    %>% summarise(
        total = n(),
        o = sum( o_bld_group )
    )
    %>% mutate( o_frequency = o / total )
)

# A tibble: 5 × 4
  country  total     o o_frequency
  <chr>    <int> <dbl>       <dbl>
1 Cameroon  1182   586       0.496
2 Gambia    5027  2152       0.428
3 Ghana      694   280       0.403
4 Kenya     3054  1546       0.506
5 Tanzania   769   350       0.455

However you do this you should see that blood group O is at about 40-50% frequency in these African populations.

Running the regression

For best results let's also make sure our status variable is coded the right way round (with "CONTROL" as the baseline):

data$status = factor( data$status, levels = c( "CONTROL", "CASE" ))

Fitting a logistic regression model is now easy in R.

instead of using lm(), which we used to fit linear regression, we will use the function glm() function instead:

fit = glm(
    status ~ o_bld_group,
    data = data,
    family = "binomial"  # needed to specify logistic regression
)

Note

Compare this with an equivalent command to fit linear regression. In the glm() command, we also have to specify the 'family', which tells R that we are fitting a logistic regression.

Specifically this command says: fit a logistic regression model of severe malaria status modelled in terms of O blood group status, using the given dataframe. The 'family' argument tells glm that the outcome variable is a binary (0/1) indicator, and that this should be modelled using a binomial distribution.

Congratulations! You've just fit a logistic regression model.

Interpreting the regression output

Please note!

The following part of this practical have been updated compared to what we ran in class - in the hope of clarifying things a bit.

You can also compare this to the updated Part I slides 43-49 on canvas for more information.

As with lm(), the best way to inspect the results of glm() is using the summary() command to get the fitted coefficients:

coefficients = summary(fit)$coefficients
print( coefficients )

You should see something like this:

> print( coefficients )
              Estimate Std. Error   z value    Pr(>|z|)
(Intercept)  0.1489326 0.02630689  5.661354 1.50183e-08
o_bld_group -0.3342393 0.03889801 -8.592709 8.49425e-18

Let's take a moment to parse this result to see what it says.

Interpreting on the log-odds scale

In many ways this is just like linear regression. You get back parameter estimates, standard errors, and P-values.

In the 'Estimate' column you see the fitted values of the parameters. There are two parameters - one for the 'intercept', which we'll call $\hat{\mu}$ , and one for O blood group itself, which we'll call $\hat{\beta}$ .

The parameters are in fact expressed on the log odds scale. Specifically:

$\hat{\mu} = 0.1489326$ is the estimated baseline log-odds of disease. (This is the log-odds of disease if you don't carry blood group O).
$\hat{\beta} = -0.3342393$ is the estimated change in log-odds, or in other words the log odds ratio, associated with blood group O versus non-O blood group.

Interpretation on the odds scale

However as described in the lecture it is also useful to interpret these on the odds (and odds ratio) scale, not just on the log-odds scale.

Three scales

Remember there are three scales you could think about in logistic regression:

The log-odds scale, where the parameter estimates live.
The odds scale, where the odds ratio estimate lives, and
The probability scale

Each one is a function of the other. To get between them you do this:

To get odds $o$ from log-odds $x$ , set $o = e^x$
To get an odds ratio from log-odds ratio $\beta$ , compute $\text{odds ratio} = e^\beta$ .
To get probability $p$ from odds $o$ , set $p = \frac{o}{1+o}$ .
Or to get probability $p$ directly from log-odds $x$ , set $p = \frac{e^x}{1+e^x}$ .

This last function is known as the 'logistic' function and you could write it like this in R:

logistic <- function(x) { exp(x) / (1 + exp(x))}

It looks like this:

As you can see, logistic() maps log-odds space $[-\infty,\infty]$ back onto probability space $[0,1]$ . It is the inverse of the 'logodds' function shown on the lecture slides:

As you can see from the graphs, a log-odds of zero corresponds to 50% probability; and larger probabilities correspond to larger log-odds.

Questions

Look at the estimates again:

> print( coefficients )
              Estimate Std. Error   z value    Pr(>|z|)
(Intercept)  0.1489326 0.02630689  5.661354 1.50183e-08
o_bld_group -0.3342393 0.03889801 -8.592709 8.49425e-18

Now try to answer the following questions about non-O blood group individuals:

What is the modelled log-odds that a non-O blood group individual is a case?
What is the modelled odds that a non-O blood group individual is a case?
And if you like, what is the modelled probability that a non-O blood group individual is a case?

And the same questions for O blood group individuals:

What is the modelled log-odds that an O blood group individual is a case?
What is the modelled odds that a O blood group individual is a case?
...and if you like, what is the modelled probability that a O blood group individual is a case?

And the most important question:

What is the estimated odds ratio?
What is the estimated relative risk of severe malaria associated with O blood group?

Do you think O blood group is associated with higher, or lower risk of severe malaria?

Hints
Hint 1
Hint 2
Hint 3
Hint 4

See if you can do this yourself. If not, use the tabs to get some hints.

You should by now have seen that, whichever scale you look on, O blood group is associated with a reduced risk of severe malaria! It has an odds ratio of about $0.72$ , meaning that it's associated with a 28% reduction in risk.

Computing a confidence/credible interval

The rule to compute a confidence / credible interval is:

Rule to compute a 95% confidence/credible interval

First compute the confidence/credible interval for $\hat{\beta}$ on the log-odds ratio scale. Then use exp() to convert back to the odds ratio / relative risk scale.

Question

Compute a 95% confidence interval for the odds ratio now.

Hints
Solution

See if you can do this yourself. If you get stuck, see the answer on the tab above.

beta = -0.3342393
se = 0.03889801
lower = beta - 1.96 * se
upper = beta + 1.96 * se

sprintf( "   The odds ratio is: %.3f.\n", exp( beta ))
sprintf( "The 95pc interval is: %.3f - %.3f.\n", exp(lower), exp(upper) )

What is your overall interpretation of these regression results?

The forest plot

Ok that's not good enough! Let's make a forest plot of the association evidence across populations. This requires a bit of code which we can implement as follows.

First, rather than typing out our model fit loads of times, let's make a function that fits the regression for us. I'll help myself out by turning the fit parameters into a proper data frame before returning:

fit_regression <- function(
    data,
    formula = status ~ o_bld_group
) {
    library( dplyr )
    fit = glm(
        formula,
        data = data,
        family = "binomial"
    )
    parameters = summary(fit)$coeff             # up to here is the same as before
    return(
        tibble::as_tibble(parameters)           # Turn it into a data frame
        %>% mutate(
            parameter = rownames( parameters )  # Make parameter names into an actual column
        )
        %>% select(                             # Rename some columns
            parameter,
            estimate = Estimate,
            se = `Std. Error`,
            z = `z value`,
            pvalue = `Pr(>|z|)`
        )
    )
}

If you run this on the original data frame, you should see what it does - gives the same estimates again but with slightly different formatting.

Now let's use some dplyr magic to run seperately in all populations:

estimates = (
    data
    %>% group_by( country )
    %>% reframe(
        fit_regression( pick( everything()) )
    )
)

Note

You don't have to worry about the details of this code - it is just a way to run our function across each country in the data seperately.

If you want to know how it works, the pick( everything()) bit just says to pass in the whole of the 'current data' into our function. (Because this is after a group_by(), that means the data for the current group.)

The reframe() function in there is similar to the group by / summarise examples from the data verbs session, but unlike summarise() allows us to return multiple rows per group. It's handy for things like this but you don't need to know it for this module.

If you look at the output you should now see two rows per country. One row contains the details for the intercept, and the other the details for the log odds ratio ( $\beta$ ) itself.

Now let's make a forest plot for the effect size parameter:

library( ggplot2 )
p = (
    ggplot(
        data = estimates %>% filter( parameter == 'o_bld_group' )
    )
    + geom_segment(
        aes(
            x = exp( estimate - 1.96 * se ),
            xend = exp( estimate + 1.96 * se ),
            y = country,
            yend = country
        ),
        colour = 'grey',
        linewidth = 1.5 # make it bigger & bolder!
    )
    + geom_point(
        aes(
            x = exp( estimate ),
            y = country,
        ),
        size = 4 # make the points bigger!
    )
    + xlim( c( 0.4, 1.2 ))
    + xlab( "Odds ratio and 95% CI" )
    + ylab( "Country" )

    # Here's some stuff to make it prettier
    + theme_minimal(16)
    + theme( axis.title.y = element_text( angle = 0, vjust = 0.5 ))
)
print(p)

Nice? But not good enough, let's do two things to make it better!

First, let's add a vertical dashed line at $1$ (which corresponds to 'no association'):

(
    p
    + geom_vline( xintercept = 1, linewidth = 1.5, linetype = 2 )
)

The second thing is a bit more subtle. We would like the plot to be on the natural (odds ratio or relative risk) scale, as it is at the moment, because that's easiest to interpret.

But on the other hand the odds ratio scale is a bit weird for a plot! (It goes from $0$ to $\infty$ , and is multiplicative i.e. with $1$ corresponding to 'no association'.) So in that sense it seems better to have this plot on the log scale.

Luckily, we can have our cake and eat it by simply telling ggplot to plot on a log scale:

(
    p
    + geom_vline( xintercept = 1, linewidth = 1.5, linetype = 2 )
    + scale_x_log10( limits = c( 0.4, 1.2 ))
)

Nice!

Question

What is your overall interpretation of these data?

Is O blood group likely associated with severe malaria across all these populations?

Congratulations - you've finished this part of the practical. Feel free to take a break for a few minutes, or read more information below if you're interested.

Other topics

Logistic vs. linear regression

Linear regression recap

In the previous tutorial we fit a linear regression model to a continuous outcome variable $Y$ and a predictor variable $X$ - namely $Y$ = % GC content, and $X$ = chromosome length. The plot looked like this:

The linear regression model was:

Y = \mu + \beta X + \text{gaussian noise}

In other words - we modelled $Y$ has normally distributed, but with a mean that depended on $X$ .

The parameter estimates were interpreted as:

$\hat{\mu} = 44.3$ was the estimated baseline value of $Y$ , also known as the intercept.
$\hat{\beta} = -0.024$ was the estimated change in $Y$ for a unit increase in $X$ .

That is, we estimated that GC content drops by about 0.024% for each megabase increase in chromosome length.

The difference in logistic regression is that the linear predictor has to be interpreted on the log-odds scale - it is not on the scale of the outcome like it is in linear regression. To link to the outcome scale (e.g. disease statys), transform through the logistic function to get the modelled probability of the outcome levels.

Hopefully you can see how the logistic model is similar to linear regression (it has the same linear predictor), but also different in how the outcome is modelled.

Logistic regression vs a 2x2 table

In fact this parameter estimate is just another way to do what you can do in a 2x2 table example. To see this, try making a table of the genotypes and outcomes:

table( data$status, data$o_bld_group )

           0    1
CONTROL 2690 2684
CASE    3122 2230

Let's think of this matrix as $\left(\begin{matrix} a & b \\ c & d\end{matrix}\right)$ and compute the sample odds ratio:

\text{odds ratio} = \frac{a/c}{b/d} = \frac{ad}{bc}

If you work this out for the above table you get:

odds_ratio = (2690*2230) / (3122*2684)

Question

Compute this sample odds ratio now, and compare it to the exponential of the parameter estimate:

exp( -0.3342393 )

What do you notice?

A key fact is that for an unadjusted regression like this (with no other variables in), the parameter estimate from logistic regression is just the logarithm of the odds ratio computed from a 2x2 table.

Logistic regression and relative risk

In the above you have hopefully identified the odds ratio as an estimate of the population relative risk. That's why we can interpret it as 'O blood group is associated with about a 30% reduction in risk of severe malaria'.

We discussed this in the lectures but here is a recap.

Relative risk

Suppose disease cases in the population come along at some frequency $f_0$ (for individuals with non-O blood group) and $f_1$ (for individuals with O blood group). The simplest way to measure association in the population is using the relative risk:

\text{relative risk} = \frac{P(\text{disease}|\text{O bld grp})}{P(\text{disease}|\text{non-O bld grp})} = \frac{f_1}{f_0}

Key fact. The key fact discussed in lectures, is that in a case-control study if the controls are 'population controls' (i.e. sampled from members of the general population), the study odds ratio is an estimate of the population relative risk. The reason is that the above can be re-written using Bayes rule in terms of genotype frequencies in cases and population controls:

\frac{P(\text{disease}|\text{O bld grp})}{P(\text{disease}|\text{non-O bld grp})} = \frac{P(\text{O bld grp}|\text{disease})}{P(\text{O bld grp}|\text{population})} / \frac{P(\text{non-O bld grp}|\text{disease})}{P(\text{non-O bld grp}|\text{population})}

The four terms there are each estimated by can be estimated by taking a sample of cases and population controls, and measuring their genotypes.

How does this correspond to the logistic regression model? It turns out the logistic regression model we fit above is equivalent to this one in a 2x2 table:

\left( \begin{matrix} & \text{non-O} & \text{O} \\ \text{controls} & 1 - e^\mu & 1 - e^{\mu+\beta} \\ \text{cases} & e^\mu & e^{\mu+\beta} \end{matrix} \right)

If you 'do the maths' on this table (work out the odds ratio) you'll see that the odds ratio is juse $e^{\beta}$ .

Thus, even though the baseline parameter $\mu$ doesn't reflect the population risk of disease (because of the ), the odds ratio $e^\beta$ reflects the relative risk in the population.

In our case if the relative risk is about 0.7 - we could say O blood group is associated with about 30% reduction in risk of severe malaria.

Fitting a logistic regression

Getting the data​

What's in the data?​

Is O blood group associated with malaria susceptibility?​

Running the regression​

Interpreting the regression output​

Interpreting on the log-odds scale​

Interpretation on the odds scale​

Computing a confidence/credible interval​

The forest plot​

Other topics​

Logistic vs. linear regression​

Logistic regression vs a 2x2 table​

Logistic regression and relative risk​

Getting the data

What's in the data?

Is O blood group associated with malaria susceptibility?

Running the regression

Interpreting the regression output

Interpreting on the log-odds scale

Interpretation on the odds scale

Computing a confidence/credible interval

The forest plot

Other topics

Logistic vs. linear regression

Logistic regression vs a 2x2 table

Logistic regression and relative risk