All tutorials can be found at the course link: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2021
Data files and other resources are at: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials
Now we will go back to the full dataset with 3 cancer types and set up the model to find differentially expressed genes between them. This will be the original DGEList object (y) that was created in Part 1.
First we need to create a design matrix based on the experimental design, which as we saw in Part 2 can be set up in several equivalent ways. Here we will define the design matrix such that each column corresponds to one of the cancer types. Each sample has a row in the design matrix, and the columns indicate which samples belong to each experimental group: 1 when the sample belongs to that group, 0 otherwise.
This means our model will have 3 coefficients, each estimating the mean expression in one of the groups. We can then use contrasts to find the difference between pairs of coefficients to find differentially expressed genes between cancer types. This is usually the safest and most intuitive way to define the model, especially for more complex experimental designs. A more detailed tutorial on this topic is available here, adapted from relevant sections of the limma Users Guide Chapter 9. There are also more details and examples in the edgeR User Guide.
design <- model.matrix(~0+conds) # conds is the factor describing our experimental groups that we created earlier
head(design)
colnames(design) <- gsub("conds", "", colnames(design)) # removing 'conds' string from the column names
head(design)
Can you see why the columns have been named in the order they appear in the design matrix? Hint, inspect the conds object again.
Recall from earlier that our factor conds describing the status of each sample was obtained from the information read in from the sample information file and stored in s.info. The design matrix that we have just created uses conds to define the cancer type of each sample in the model. Thus, the first 3 samples are all breast cancer, the 4th is uterine, the 5th is breast etc. We can double check that this is consistent with our sample details.
head(s.info)
This is also why the check for the order of samples in the data object matching that of s.info was so important: the rows of the design matrix are assumed by R to correspond to the order of columns in the data object. Double-checking everything is correctly ordered/defined at this point is important to be sure we run the analysis with samples correctly assigned to their respective groups!
The variability of a gene across samples is a crucial factor in determining the significance of a change in gene expression. edgeR computes a common dispersion estimate for all genes, a trended estimate by mean expression, and individual dispersion estimates for each gene. By default, edgeR will make use of the gene-specific estimates, which are now thought to be the most appropriate choice for assessing differential expression. They are also adjusted to make them more robust to the typically small sample size in gene expression studies. With small samples, the estimates of mean and variance are generally poorer and can, for example, lead to false positive results if the variance is under-estimated. Built into limma and edgeR is a procedure to ‘borrow information between genes’ and improve the robustness of the variance estimates by ‘squeezing’ them towards a common value, according to how other genes with similar expression levels behave.
y <- estimateGLMCommonDisp(y, design, verbose=T)
## Disp = 0.68458 , BCV = 0.8274
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
plotBCV(y) # a plot of the mean-variance relationship
This step fits the negative binomial (NB) glm to each gene separately. Even for thousands of genes, it completes almost immediately.
fit <- glmFit(y, design)
We can now define a contrast matrix to extract the comparisons of interest - this time all pair-wise comparisons between the 3 groups. Note we can refer to the groups by the column names of the design matrix.
cont.matrix <- makeContrasts(BRCA-KIRC, BRCA-UCEC, KIRC-UCEC, levels=design)
cont.matrix
The first contrast is BRCA vs KIRC, indicated by the 1 and -1 in the first column of the contrast matrix. This means that genes more highly expressed in the first group, BRCA, will have positive logFCs and genes with lower expression in BRCA compared to KIRC samples will have negative logFCs. Genes similarly expressed in both groups will result in logFCs close to zero. In this dataset, there is no natural ‘reference’ group, but often it makes sense to have control group as the second group, so changes are reported relative to control condition. It’s important to communicate the order of comparison when sharing results of a differential expression analysis.
res.brca.kirc <- glmLRT(fit, contrast=cont.matrix[,1])
class(res.brca.kirc)
This step completes the testing for differential expression and stores the results in a new object we have named res.brca.kirc. This object contains a large number of components relating to the statistical testing. The main results table can be accessed with:
head(res.brca.kirc$table)
This is a similar format to the topTable output of limma, and contains the results for all analysed genes. Notice that they are in alphabetical order at this point and also that only raw p-values are given. The testing procedure generates a likelihood ratio (LR) as a test statistic and corresponding p-values. The LR comes from comparing the fit of the full model (with our explanatory variable, the different groups) to a reduced model not including this explanatory variable - when a gene is differentially expressed between groups, the fit will be better under the full model, and if it is substantially better will generate a significant p-value.
We can summarise the genes called significantly differentially expressed.
de <- decideTestsDGE(res.brca.kirc)
## by default summarises results at 5% FDR and tells us how many significantly up- and down-regulated genes there are
summary(de <- decideTestsDGE(res.brca.kirc))
We can also visualise the results for each gene by plotting mean expression level vs logFC.
detags <- rownames(y)[as.logical(de)]
plotSmear(res.brca.kirc, de.tags=detags, main="BRCA vs KIRC", ylim=c(-10,10))
abline(h=c(-1,1), col="blue")
To rank the genes according to evidence for differential expression between the BRCA and KIRC samples we can run the following commands.
o <- order(res.brca.kirc$table$PValue)
res.brca.kirc <- res.brca.kirc$table[o,]
It is common practice to adjust the raw p-values for multiple testing (of thousands of genes) with a false discovery rate correction. A more detailed explanation of multiple testing correction can be found on p5 of this tutorial
## compute and add adjusted p-values
adjp <- p.adjust(res.brca.kirc$PValue, method="fdr")
res.brca.kirc <- cbind(res.brca.kirc, adjp)
head(res.brca.kirc)
write.table(res.brca.kirc, "Cancer_dataset_BRCA_vs_KIRC_edgeR_results.txt", sep="\t", quote=F, row.names=T)
This saves the entire output to file - it is always worth writing out the full table of genes as it can be filtered to significant genes later, but if you want to check the result for a particular gene (maybe it just missed the significance threshold) and had only saved the significant genes, you would need to run all the analysis code again.
Notice that the file naming convention is consistent with the way our contrasts were defined earlier and helps interpret the logFCs correctly - they could be cross-checked against raw data but otherwise there would be no way of knowing which group was used as the reference. Again by convention, the logFCs are log2 scale with 0 indicating no difference between groups, positive logFCs indicate higher expression in the first group, and negative logFcs indicate higher expression in the second group.
length(which(res.brca.kirc$adjp<0.05)) # 7434
Repeat the last section to extract differential expression results for:
Breast vs uterine cancer samples
Kidney vs uterine cancer samples
Store each set of results in a suitably-named object and note how many significant genes there are in each list. Use the intersect function to determine the overlap between each pair of lists and maybe try creating a Venn diagram.
There are various ways that the results can be visualised and further explored for biological insights.
For example, we can make boxplots for the top 10 differentially expressed genes for the breast vs kidney comparison and save them to a PDF. The output file can be opened from your working directory once this segment of code has been run.
cpm <- as.matrix(cpm(y))
pdf("Cancer_dataset_boxplots_top10_genes_BRCAvsKIRC.pdf", onefile=T)
# use a loop to iterate over 10 genes
for(i in 1:10)
{
x <- match(row.names(res.brca.kirc)[i], row.names(cpm))
boxplot(cpm[x,]~conds, las=2, cex.axis=0.8, main=paste0(row.names(res.brca.kirc)[i]), ylab="Counts per million")
}
dev.off()
We can also make a heatmap for say the top 100 differentially expressed genes or explore the over-represented biological pathways that might give some clues for interpreting the results.
edgeR offers some ways to explore the functional roles of genes found differentially expressed, as do many other R/Bioconductor packages such as Goseq or tools such as GSEA from the Broad Institute, which is also implemented in R. The biomaRt package is very useful for mapping between and retrieving different gene annotations and identifiers.
The original limma package for microarray data was subsequently extended to include methods to deal with RNA-Seq data by transforming the count data prior to fitting standard linear models (the alternative to using models specifically for count data). There was some dilemma over which approach would be best and in the end they did both! This is the limma-voom approach and is described in Chapter 15 of the limma Users Guide. If you are really keen, you can run the same analysis with limma-voom and see how the results compare.
There are also a host of other packages and methods for analysing RNA-Seq data - we have focused on edgeR as it is the one we primarily use, largely based on the strong reputation and track record of its developers.
More resources can be found here: