Association example: gene expression data
Linear regression can also be used to assess association when the predictor variable is discrete rather than continuous.
Let's use it to investigate a link between a particular genetic variant (rs1419114) and gene expression levels of the gene it is in - ATP2B4.
If you want to see this SNP in its genomic context, look it up in the UCSC Genome Browser. The SNP lies in an intron of ATP2B4, but has been associated with several red cell traits as well as with malaria susceptibility.
The data can be found
here. To load
into R, right-click and choose 'copy link' and then load in R using the read_tsv()
function as follows:
library( tidyverse )
atp2b4_data = read_tsv( "https://www.chg.ox.ac.uk/bioinformatics/training/msc_gm/2024/data/atp2b4_expression_example.tsv" )
What's in the data
Look at this data now using print()
, View()
, or other tools.
It has a sample identifier, a column reflecting the tissue the sample was drawn from, the genotype for rs1419114, and the measured expression of the gene.
We have encoded the genotype as 0, 1, or 2 depending on how many copies of the non-reference allele the sample carriers.
The reference and non-reference alleles for this variant are actually A
and G
, so the genotypes are:
Genotype | Encoding |
---|---|
AA | 0 |
AG | 1 |
GG | 2 |
Plotting the data and estimating association
Let's plot the data now.
p = (
ggplot( data = atp2b4_data )
+ geom_point( aes( x = genotype, y = atp2b4_expression ))
+ xlab( "rs1419114 genotype" )
+ ylab( "ATP2B4 expression")
)
print(p)
Does it look to you like the genotype and expression level are associated?
The 'ATP2B4 expression' levels here have been computed from RNA-seq data. In short, the number of RNA-seq reads aligning over the gene was computed, and then it was normalised by the total number of RNA-seq reads for the sample. (In most RNA-seq analyses, you'd also want to normalise by gene length to compute 'transcripts per million' (TPM), but we won't bother doing that here because we are focussed on just one gene.)
Fitting a regression model
Let's estimate association now:
fit = lm( atp2b4_expression ~ genotype, data = atp2b4_data )
coeffs = coefficients( summary( fit ))
Add a line to the plot to show the estimated relationship. Can you interpret the coefficients?
To make sense of this, you have to understand the model that is being fit. In math syntax it can be written as follows:
Here and are the two parameters (output in the two rows above) - is the intercept, which captures the modelled average expression values for genotype=0 individuals; and is the regression effect size parameter.
should therefore be interpreted as: the estimated increase in expression associated with each additional copy of the 'G' allele.
All lm
is doing for us is finding the values of and which best fit the data.
Interpreting the regression
If you print the coefficients they should look something like this:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.165760 4.785387 2.542273 0.01856574
genotype 4.395767 2.842943 1.546203 0.13632062
So what does this output say? It says that:
- the estimated mean expression level for genotype individuals (the regression intercept) is .
- the standard error of this estimate is .
- the estimated association effect size (the regression slope) is .
- the standard error of this estimate is .
In other words:
- in people with two copies of the reference allele, the average expression is
- each copy of the non-reference allele is associated with an increase of in the level of expression.
However the model is not very certain about these estimates. For example, using our rule
the slope could plausibly be as low as -1.18 or as high as 9.97!
Interpreting the P-values
The output also gives P-values. These answer the question
"how likely would be be to see such a big effect by chance?
or more accurately
"How likely would it be to see an estimate as large as the one we've seen, in a sample of this size, if the true parameter was actually zero?".
You can see that we'd be pretty unlikely to have seen an intercept that high 'just by chance' - for the intercept, so we might see this in only about 2 in a hundred experiments. But it's certainly possible that the positive slope could be a chance finding - because , so it might happen by chance in >10% of experiments like this.
A regression model with multiple predictors
One of the great advantages of regression over other methods is that it lets us 'control' for other variables that might be related to our outcome or predictor.
We'll come back to this later but for now let's try re-fitting the regression conditional on the other variable - the tissue from which the sample was derived (fetal liver or bone marrow):
fit2 = lm( atp2b4_expression ~ genotype + tissue, data = atp2b4_data )
summary( fit2$coefficients )
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.993531 3.732931 2.945013 0.0077329478
genotype 8.615789 2.458455 3.504555 0.0021095614
tissueliver -11.018946 2.809053 -3.922656 0.0007815145
You should now see three estimates - the baseline, the slope corresponding to the genotype, and another parameter reflecting an average difference between the bone marrow and fetal lever sample.
What do you conclude about the association now - is the genotype associated to the expression level?
Challenge questions
The regression above look for an additive effect of the genotype on the expression level. But is this really sensible? These challenge questions explore this"
Given all the feedback loops and complexities of gene transcription, it might be reasonable to think that genetic effects on transcription might be more like multiplicative effects, instead of additive ones.
To assess this, can you repeat the regressions above using the logarithm of the expression level as the outcome? I.e. something like this:
atp2b4_data$log_atp2b4_expression = log10( atp2b4_data$atp2b4_expression )
(The logarithm base is not very important here, so I suggest using the base 10 logarithm, as it's usually simplest to interpret.)
Make a histogram of the expression values before and after logarithm to see what this transformation has done.
Now re-fit the regression models - do you draw similar conclusions as above?
Note 1 The results will of course be numerically different to those above - after all we have transformed all the outcome values.
Note 2 Working on the logarithmic scale amounts to estimating multiplicative instead of additive effects of the genotype on the outcome.
This is true because of the basic raison d'etre of the logarithm function, namely that it carries multiplication on the natural, un-logged scale to addition on the log scale, like this:
So for example, if your estimate using the log-scale expression is , then you could interpret that as
"The estimated increase in log10-expression for each additional copy of the 'G' allele is 2."
or as
"The estimated expression is 100-fold higher for each additional copy of the 'G' allele."
(since ). Or even as
"The estimated fold change in expression for each additional copy of the 'G' allele is 100."
(Although your estimate will probably be much smaller than this!)
Maybe we shouldn't be encoding additive effects anyway - perhaps the genotype might affect the trait through a dominance relationship (i.e. only the GG
genotype would have an effect, compared to AA
and AG
) or a recessive relationship (the AG
and GG
genotypes would have the same effect, relative to AA
.)
Can you assess whether there's any evidence for dominance or recessive effects? (For example - you could do this by adding new variables with different genotype encodings to the data.)
(However, if this SNP is in a promoter region for the gene, there is however a biological reason why we might expect the effect to get larger with each copy of the allele. Can you think what this is?)