Skip to main content

Part 1: Logistic regression warmup

In the skills training, we imaged the following scenario:

A drug company has tested out a new treatment for a chronic condition.  They recruit some people and treat them
either with a placebo or their new treatments. They then follow up after a year to see who has recovered. The data looks like this:
data = tibble::tribble(
~group, ~disease, ~recovered,
'placebo', 93, 7,
'treatment A', 78, 12,
'treatment X', 5, 1
)

Should they be excited about their treatment?

Recap: simple approach from skills training

In the skills training we analysed this in a simple way by:

  • We assumed each row of the table was drawn from a binomial distribution with recovery rate pp:

P(recovered)=(Nk)px(1p)NxP(\text{recovered}) = {N \choose k} p^x (1-p)^{N-x}

(if there were xx recovered out of a total NN in the row.)

  • To get a simple P-value we assumed the 'background' rate of recovery was exactly 7% - using the point estimate from the placebo row.
Recap

So for example for Treatment A we compute a P-value as:

binomial_p_value = pbinom( 12, size = 90, prob = 0.07, lower.tail = F )

which gave the tail probability under the binomial distribution - that is, the P-value - reflecting the question: "How unlikely was it to see at least 12 of recovered, if we sampled 90 people and the true rate of recovery was 7%?".

Using logistic regression to improve this analysis

The simple analysis above has two major flaws.

  1. We didn't take into account uncertainty in the estimate from placebo. Yes, the point estimate was 7%, but what if the actual rate of recovery was 10%? Or 5%?

  2. There was no way to build in any other variables that might affect the results. For example, what if the individuals had different ages, sexes, ethnicities or environments? Could that affect results?

Addressing this in this table example leads to logistic regression.

To try out a logistic regression, let's load the full study data now - we'll focus on Treatment A:

drug_trial_data = readr::read_tsv( "https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2025/data/drug_trial_example.tsv" )
Note

Do your usual due diligence on this data frame. What's in the data?

In particular, check that if you table the treatment and outcome columns you should get the same numbers as above. But this is the full data from the study - some other variables were collected as well.

Fitting logistic regression

Fitting a logistic regression is easy using the glm() function in R.

Logistic regression requires an outcome variable encoded as 0 or 1. Let's create that now. We'll use the data verbs way of doing this and use mutate:

library( dplyr )
drug_trial_data = (
drug_trial_data
%>% mutate(
outcome_01 = c( "disease" = 0, "recovered" = 1 )[outcome]
)
)
Note

If you don't like the above code, you could also do this using simpler R commands, like this:

drug_trial_data$outcome_01 = NA
drug_trial_data$outcome_01[ drug_trial_data$outcome == 'disease' ] = 0
drug_trial_data$outcome_01[ drug_trial_data$outcome == 'recovered' ] = 1
Note

Note that we have encoded this so 'recovered' gets level 1. This is important when we interpret results below!

Now to fit the logistic regression model, use glm:

g = glm(
outcome_01 ~ treatment, # The model 'formula'
family = "binomial", # The modelling distribution - here binomial because it's a 0/1 outcome
data = drug_trial_data # The data frame to fit the model in
)

You can print out g (try it!) but the simplest (and best) way to see the results is to look at the estimated coefficients:

summary(g)$coeff

You should see something like this:

> summary(g)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.5866893 0.3919302 6.599873 4.115106e-11
treatmenttreatment_A -0.7148872 0.4997631 -1.430452 1.525873e-01

Congratulations! You have now run a logistic regression!

Interpreting the output

The parameter estimates

Focus on the 'estimate' column first. Remember we are working on the log-odds scale. This means the estimates can be interpreted as follows:

  • The 'intercept' estimate is -2.587. This is the estimated log-odds of recovery for placebo-treated participants, i.e. those not having treatment A.
Terminology: the baseline log-odds

This 'intercept' is also called the (estimated) baseline log-odds. It is the log-odds corresponding to the 'baseline' predictor level against which other effects are measured in the model.

In this model, the baseline predictor level is treatment=placebo and the baseline log-odds is −2.587.

If we want to convert this back into the estimated probability/frequency of recovery, we need the logistic function:

P(recoveryplacebo)=logistic(2.587)=e2.5871+e2.587P(\text{recovery}|\text{placebo}) = \text{logistic}(-2.587) = \frac{e^{-2.587}}{1+e^{-2.587}}

If you want to calculate this, just create up the logistic function in R:

logistic <- function(x) {
exp(x)/(1 + exp(x))
}

logistic(-2.587)

If you've got this right, it should be pretty close to 7%.

What about the treatment effect?

  • The 'treatment_A' effect is 0.715. This means that the estimated additional log-odds of recovery conferred by the treatment is 0.715.
Terminology: the log odds ratio

The number 0.715 here is also called the (estimated) log odds ratio.

The odds ratio is e0.715e^{0.715}.

This terminology is because a difference on the log-odds scale corresponds to a ratio on the odds scale. (I've explained the rationale for this naming further in the appendix below), but let's carry on for now.)

If we want to figure out the probability of recovery for a treated individual, we have to add up the baseline logodds and the additional log-odds due to treatment (the log odds ratio) - and then transform to probabiltiy scale, to get

logistic( -2.587 + 0.715 )
Question

How do these numbers correspond to the estimates you would have made from the original data table?

These results highlight the first big advantage of the regression approach. It makes estimates of a parameter which directly estimates the effect of the thing of interest (i.e. how much 'better' the treatment is compared to the placebo).

An aside on variable coding

Warning

How did R know we cared about treatment A versus placebo, rather than the other way round?

The answer here turns out to be is that 'placebo' is before 'treatment A' in alphabetical order, so R decided to use 'placebo' as the baseline.

In general you should be careful to specify the order of variable levels yourself. One way to do this in R is to change the variables into 'factors' (which are categorical variables that have an underlying integer ordering). Something like this:

drug_trial_data$treatment = factor(
drug_trial_data$treatment,
levels = c( "placebo", "treatment_A" )
)

This approach would work even if it was called something like 'untreated' which comes later in the alphabet. At this point you have told R that you definitely want the 'placebo' to be considered the first level.

(Another way is to re-code the variable into a 0/1 variable, as we did above for the outcome.)

We can ignore this for now, but you should be aware of variable level ordering when doing this kind of work.

Interpreting the P-values

Each parameter in the output also gets a P-value (in the fourth column).

The P-values reflect the question: "How unlikely was such a strong estimated effect, if the true parameter was zero"?

In this analysis, the effect of the treatment is a single parameter with zero meaning 'no effect compared to the placebo', so that's the P-value that's of interest to us here.

Note

Compare these P-values to the simple binomial analysis above. Is the result different? Why?

Intepreting the standard errors

The whole point of statistics is to be able to work with uncertainty in a quantitative way. If you look at the table you can see that this analysis has represented uncertainty by giving us a standard error for each parameter estimate. It says that:

  • the baseline log-odds is estimated as -2.587 with a standard error of 0.392.
  • the treatment A effect is estimated as 0.715 with a standard error of 0.500.

If we call the baseline and treatment effects μ=2.587\mu=-2.587 and β=0.715\beta=0.715 respectively, the model is:

P(outcome=recovered)=logistic(μ+β×(treatment))P(\text{outcome}=\text{recovered}) = \text{logistic}\left( \mu + \beta \times (\text{treatment})\right)

There are a couple of ways to interpret the standard errors. My favourite (the 'bayesian approach') is to regard these as reflecting the approximate posterior distribution of the parameters. In other words, it says that having seen the data, our belief about the effect parameter should approximately correspond to a normal distribution with mean 0.715 and standard deviation 0.5.

The other interpretation (the 'frequentist approach') is to interpret these standard errors as estimated amount of variation that we would see if we repeated the experiment. In other words, the estimate of the parameter from this dataset is 0.715, but if we repeated the experiment we might certainly expect to see estimates up to a few standard errors away from this estimate.

Logistic regression with covariates

The second big advantage of regression approach is that it lets us deal with possible individual- or sample-level variation, that isn't reflected in the overall summary table. For example if you look back at the data now you'll see there are a couple of other variables that we could add in. Let's see if glm() can handle them:

g = glm(
outcome_01 ~ treatment + age + sex, # <-- We've added age to the model
data = drug_trial_data,
family = "binomial"
)
summary(g)$coeff

So we have jointly estimated the association between the treatment, and age and sex, and the outcome.

Questions

What is the estimated change in log-odds of recovery corresponding to an increase of one year's age in this analysis? What is the estimated change in log-odds of recovery corresponding to being female?

Appendix: Interpreting model formulae

If you look back at the glm command above we used a 'model formula' like this:

outcome ~ treatment + age
Warning

Confusing point! This does not mean that outcome is equal to the sum of treatment and age!

Instead this regression formula has a special interpretation. What it means, in fact is that:

  • The outcome is first transformed to a probability on the log-odds scale (it knows how to do this because we set family="binomial")
  • A baseline log-odds is added (even though we didn't ask for one)
  • ...and each of the predictors is also given a parameter that the model will also estimate.

The upshot in this case is that the model being fit is of this form:

logodds(outcome=recovered)=μ+β1×treatment+β2×age\text{logodds}(\text{outcome}=\text{recovered}) = \mu + \beta_1 \times \text{treatment} + \beta_2 \times \text{age}

Then glm() estimates the baseline log-odds μ\mu, and the effect parameters β1\beta_1 and β2\beta_2 as above.

Terminology: the linear predictor

The quantity on the right-hand side of the above

μ+β1×treatment+β2×age\mu + \beta_1 \times \text{treatment} + \beta_2 \times \text{age}

is called the linear predictor. (It's 'linear' because as you can see, it's the equation of a straight line in terms of the predicor variables treatment and age.)

Appendix: the log-odds ratio

In the results above:

> summary(g)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.5866893 0.3919302 6.599873 4.115106e-11
treatmenttreatment_A -0.7148872 0.4997631 -1.430452 1.525873e-01

The number 0.715 is the (estimated) additional log-odds conferred by treatment A (compared to the placebo).

However i=t is also often also referred to as the "log odds ratio". To see why, you have to think on the odds scale, which in this transformation sits between probability and log-odds scale.

Probabilityodds=p1poddslog(odds)log-odds\text{Probability} \stackrel{\text{odds} = \frac{p}{1-p}}{\rightarrow}\text{odds}\stackrel{\log(\text{odds})}{\rightarrow}\text{log-odds}

We have

logodds(recoveryplacebo)2.587\text{logodds}(\text{recovery}|\text{placebo}) \approx -2.587

logodds(recoverytreatment A)2.587+0.715\text{logodds}(\text{recovery}|\text{treatment A}) \approx -2.587 + 0.715

If we now convert these back to odds scale by exponentiating, we get:

odds(recoveryplacebo)e2.587andodds(recoverytreatment A)e2.587+0.715\text{odds}(\text{recovery}|\text{placebo}) \approx e^{-2.587} \quad\text{and}\quad \text{odds}(\text{recovery}|\text{treatment A}) \approx e^{-2.587 + 0.715}

And if we now take the ratio of these:

odds(recoverytreatment A)odds(recoveryplacebo)e2.587+0.715e2.587=e0.715\frac{ \text{odds}(\text{recovery}|\text{treatment A}) }{ \text{odds}(\text{recovery}|\text{placebo}) } \approx \frac{e^{-2.587 + 0.715}}{e^{-2.587}} = e^{0.715}

This means that the odds ratio is e0.715e^{0.715} and the log odds ratio is 0.715.

The point is that addition on the log-odds scale corresponds to multiplication on the odds scale (because one is just the exponential of the other.)