Skip to main content

Another example: gene expression

Linear regression can also be used to assess association when the predictor variable is discrete rather than continuous.

Let's use it to estimate association between a particular SNP (rs1419114) and expression of the gene it is in - ATP2B4.

Note

If you want to see this SNP in its context, look it up in the UCSC Genome Browser.

The data can be found here. To load into R, right-click and choose 'copy link' and then load in R using the read_tsv() function as follows:

library(  tidyverse )
atp2b4_data = read_tsv( "<paste your link here>" )

What's in the data

Look at this data now using print(), View(), or other tools.

It has a sample identifier, a column reflecting the tissue the sample was drawn from, the genotype for rs1419114, and the measured expression of the gene.

Note

We have encoded the genotype as 0, 1, or 2 depending on how many copies of the non-reference allele the sample carriers.

Plotting the data and estimating association

Let's plot the data now.

p = (
ggplot( data = atp2b4_data )
+ geom_point( aes( x = genotype, y = atp2b4_expression ))
+ xlab( "rs1419114 genotype" )
+ ylab( "ATP2B4 expression")
)
print(p)

Does it look to you like the genotype and expression level are associated?

Fitting a regression model

Let's estimate association now:

fit = lm( atp2b4_expression ~ genotype, data = atp2b4_data )
coeffs = coefficients( summary( fit ))
Challenge

Add a line to the plot to show the estimated relationship. Can yopu interpret the coefficients?

To make sense of this, you have to understand something about the model that is being fit. In math syntax it can be written as follows:

expression level=μ+β×genotype+gaussian noise\text{expression level} = \mu + \beta \times \text{genotype} + \text{\textit{gaussian noise}}

Here μ\mu and β\beta are the two parameters (output in the two rows above) - μ\mu is the intercept, which captures the average expression values; and β\beta is the regression effect size parameter.

Note

β\beta should therefore be interpreted as: the estimated increase in expression associated with each additional copy of the 'G' allele. It is often called the association effect size.

All lm does is find the values of μ\mu and β\beta which best fit the data.

Interpreting the regression

If you print the coefficients they should look something like this:

              Estimate Std. Error  t value  Pr(>|t|)
(Intercept) 0.04981156 0.02097038 2.375329 0.0266714
dosage 0.02090930 0.01245827 1.678347 0.1074296

So what does this output say? It says that:

  • the estimated mean expression level is 0.04980.0498.
  • the standard error of this estimate is 0.02100.0210.
  • the estimated association effect size is 0.02090.0209.
  • the standard error of this estimate is 0.01250.0125.

A regression model with multiple predictors

One of the great advantages of regression over other methods is that it lets us 'control' for other variables that might be related to our outcome or predictor.

We'll come back to this later but for now let's try re-fitting the regression conditional on the other variable - the tissue from which the sample was derived (fetal liver or bone marrow):

fit2 = lm( atp2b4_expression ~ genotype + tissue, data = atp2b4_data )
summary( fit2$coefficients )
              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 10.993531 3.732931 2.945013 0.0077329478
genotype 8.615789 2.458455 3.504555 0.0021095614
tissueliver -11.018946 2.809053 -3.922656 0.0007815145

You should now see three estimates - the baseline, the slope corresponding to the genotype, and another parameter reflecting an average difference between the bone marrow and fetal lever sample.

Question

What do you conclude about the association now - is the genotype associated to the expression level?