In this section, we will explore setting up the analysis model and interpreting the results, using RNA-Seq data from the Cancer Genome Atlas project (https://cancergenome.nih.gov).
Tutorial files can be found at: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2021
Data files and other resources are at: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials
We’ll assume all steps from part 1 have been run in your current R session. The edgeR package should also be available in your session but if it is not, or you are starting a new session, load the edgeR library:
library(edgeR)
In this section, we will look at R’s methods for defining linear models and determining what the coefficients represent.
Let’s start with just 2 groups and look in detail at setting up the analysis model:
conds.ku <- factor(conds[conds %in% c("KIRC", "UCEC")])
conds.ku
## exclude the BRCA samples from our data and sample info objects
## NB new DGEList objects are created in the code below so that we can revert to the full dataset with 3 groups later on.
brca <- which(s.info$Type=="BRCA")
y1 <- DGEList(counts=data[,-brca], genes=row.names(data))
## filter this smaller dataset
keep <- rowSums(cpm(y1)> 0.2) >= 9
y1 <- y1[keep,]
dim(y1)
y1 <- calcNormFactors(y1)
## create the design matrix, which describes the samples and defines the model
design1 <- model.matrix(~conds.ku)
colnames(design1) <- gsub("conds.ku", "", colnames(design1)) # removing 'conds.ku' string from the column names
head(design1)
This produces a design matrix with 2 columns - the first column consists of 1s and the second column contains a 1 for each UCEC sample and 0 otherwise. This defines a model with 2 coefficients, the first estimating mean expression in KIRC samples and the second estimating the difference between UCEC and KIRC. This is useful, as one of the coefficients directly represents the difference between the 2 groups, which is our comparison of interest.
Results can be found with a few more lines of code (more details on this later)
## estimate dispersions
y1 <- estimateGLMCommonDisp(y1, design1, verbose=T)
y1 <- estimateGLMTrendedDisp(y1, design1)
y1 <- estimateGLMTagwiseDisp(y1, design1)
## fit the model and statistical testing
fit1 <- glmFit(y1, design1)
res1 <- glmLRT(fit1)
## inspect results ordered on p-value column
head(res1$table[order(res1$table$PValue),])
Notice that R orders the levels of a factor alphabetically, so that when we create the conds.ku factor above its levels were listed as ‘Levels = KIRC UCEC’ You can see this by inspecting the conds.ku object:
conds.ku
In the cancer dataset, there isn’t a natural base or reference group but when the experiment includes an obvious control group (wildtype, untreated etc), it is helpful to ensure that this is the first group listed in the conditions factor. For example we might have groups named “Wildtype” and “Mutant”, which R by default would order as “Mutant”, “Wildtype” in the factor levels. We would need to change the order to generate a design matrix corresponding to the comparison Mutant vs Wildtype.
The order of the levels can be reversed when creating the factor, or afterwards by calling the factor function again and explicitly setting the levels. Going back to our cancer data, we can do this as follows:
conds.uk <- factor(conds.ku, levels=c("UCEC", "KIRC"))
## again create a new data object, y2, for this section
y2 <- y1
design2 <- model.matrix(~conds.uk)
colnames(design2) <- gsub("conds.uk", "", colnames(design2))
head(design2)
Notice that this time the KIRC samples are indicated by 1s in the second column, whereas before it was the UCEC samples. In this case, the first coefficient estimates the mean in UCEC samples and the second coefficient estimates the difference between KIRC and UCEC samples.
How might we expect the results to look compared to the model defined by design1?
Run the following steps again to repeat the analysis with the ‘reversed’ model
y2 <- estimateGLMCommonDisp(y2, design2, verbose=T)
y2 <- estimateGLMTrendedDisp(y2, design2)
y2 <- estimateGLMTagwiseDisp(y2, design2)
fit2 <- glmFit(y2, design2)
res2 <- glmLRT(fit2)
head(res2$table[order(res2$table$PValue),])
Everything in the output stays the same except for the sign of the logFC column. In the first case, the comparison UCEC vs KIRC was defined and in the second case, the comparison KIRC vs UCEC was defined. We could check the raw counts to confirm this. It is important to be clear when writing your analysis script and communicating the results which group has been compared to which - the output does not indicate this and as shown above it will depend on the naming of your groups and any change you may make to the default alphabetical ordering.
The above setup works well when there are just 2 groups, but with more groups or more complex experimental designs it is more intuitive to set up the model with each column corresponding to one group. In this case, each coefficient will estimate mean expression in a given group and the model will not directly estimate differences between groups. Then, any pair of groups can be compared by extracting the difference between estimated coefficients as a contrast (this requires an additional piece of code defining a contrast matrix).
y3 <- y1
## In this case, we can define the model as:
design3 <- model.matrix(~0+conds.uk)
colnames(design3) <- gsub("conds.uk", "", colnames(design3)) # removing 'conds.uk' string from the column names
design3
## Now the first column has 1s for UCEC samples and 0 otherwise, while column 2 has 1s for KIRC samples and 0 otherwise.
y3 <- estimateGLMCommonDisp(y3, design3, verbose=T)
y3 <- estimateGLMTrendedDisp(y3, design3)
y3 <- estimateGLMTagwiseDisp(y3, design3)
cont.matrix <- makeContrasts(KIRC-UCEC, levels=design3)
cont.matrix
fit3 <- glmFit(y3, design3)
res3 <- glmLRT(fit3, contrast=cont.matrix[,1])
head(res3$table[order(res3$table$PValue),])
Verify that the output matches the result from the same KIRC vs UCEC comparison generated by design2