Another example: gene expression
Linear regression can also be used to assess association when the predictor variable is discrete rather than continuous.
Let's use it to estimate association between a particular SNP (rs1419114) and expression of the gene it is in - ATP2B4.
If you want to see this SNP in its context, look it up in the UCSC Genome Browser.
The data can be found
here. To load
into R, right-click and choose 'copy link' and then load in R using the read_tsv()
function as follows:
library( tidyverse )
atp2b4_data = read_tsv( "<paste your link here>" )
What's in the data
Look at this data now using print()
, View()
, or other tools.
It has a sample identifier, a column reflecting the tissue the sample was drawn from, the genotype for rs1419114, and the measured expression of the gene.
We have encoded the genotype as 0, 1, or 2 depending on how many copies of the non-reference allele the sample carriers.
Plotting the data and estimating association
Let's plot the data now.
p = (
ggplot( data = atp2b4_data )
+ geom_point( aes( x = genotype, y = atp2b4_expression ))
+ xlab( "rs1419114 genotype" )
+ ylab( "ATP2B4 expression")
)
print(p)
Does it look to you like the genotype and expression level are associated?
Fitting a regression model
Let's estimate association now:
fit = lm( atp2b4_expression ~ genotype, data = atp2b4_data )
coeffs = coefficients( summary( fit ))
Add a line to the plot to show the estimated relationship. Can yopu interpret the coefficients?
To make sense of this, you have to understand something about the model that is being fit. In math syntax it can be written as follows:
Here and are the two parameters (output in the two rows above) - is the intercept, which captures the average expression values; and is the regression effect size parameter.
should therefore be interpreted as: the estimated increase in expression associated with each additional copy of the 'G' allele. It is often called the association effect size.
All lm
does is find the values of and which best fit the data.
Interpreting the regression
If you print the coefficients they should look something like this:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.04981156 0.02097038 2.375329 0.0266714
dosage 0.02090930 0.01245827 1.678347 0.1074296
So what does this output say? It says that:
- the estimated mean expression level is .
- the standard error of this estimate is .
- the estimated association effect size is .
- the standard error of this estimate is .
A regression model with multiple predictors
One of the great advantages of regression over other methods is that it lets us 'control' for other variables that might be related to our outcome or predictor.
We'll come back to this later but for now let's try re-fitting the regression conditional on the other variable - the tissue from which the sample was derived (fetal liver or bone marrow):
fit2 = lm( atp2b4_expression ~ genotype + tissue, data = atp2b4_data )
summary( fit2$coefficients )
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.993531 3.732931 2.945013 0.0077329478
genotype 8.615789 2.458455 3.504555 0.0021095614
tissueliver -11.018946 2.809053 -3.922656 0.0007815145
You should now see three estimates - the baseline, the slope corresponding to the genotype, and another parameter reflecting an average difference between the bone marrow and fetal lever sample.
What do you conclude about the association now - is the genotype associated to the expression level?