---
title: "RNA-Seq Workshop - Part 3"
date: 14 December 2021
author: Helen Lockstone and Ben Wright, Bioinformatics Core, WHG
output: html_document
---


All tutorials can be found at the course link: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Dec2021

Data files and other resources are at: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials


```{r include = FALSE}
knitr::opts_chunk$set(eval=FALSE)
```

## Differential Expression Analysis

Now we will go back to the full dataset with 3 cancer types and set up the model to find differentially expressed genes between them. This will be the original DGEList object (y) that was created in Part 1. 


First we need to create a design matrix based on the experimental design, which as we saw in Part 2 can be set up in several equivalent ways. Here we will define the design matrix such that each column corresponds to one of the cancer types. Each sample has a row in the design matrix, and the columns indicate which samples belong to each experimental group: 1 when the sample belongs to that group, 0 otherwise. 

This means our model will have 3 coefficients, each estimating the mean expression in one of the groups. We can then use contrasts to find the difference between pairs of coefficients to find differentially expressed genes between cancer types. This is usually the safest and most intuitive way to define the model, especially for more complex experimental designs. A more detailed tutorial on this topic is available [here](https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day2_271118/Part3_limma_analysis_models.pdf), adapted from relevant sections of the limma Users Guide [Chapter 9](https://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf). There are also more details and examples in the [edgeR User Guide](http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf).


```{r}
design <- model.matrix(~0+conds) # conds is the factor describing our experimental groups that we created earlier
head(design)
colnames(design) <- gsub("conds", "", colnames(design)) # removing 'conds' string from the column names
head(design)
```
Can you see why the columns have been named in the order they appear in the design matrix? *Hint, inspect the **conds** object again.*

Recall from earlier that our factor **conds** describing the status of each sample was obtained from the information read in from the sample information file and stored in **s.info**. The design matrix that we have just created uses **conds** to define the cancer type of each sample in the model. Thus, the first 3 samples are all breast cancer, the 4th is uterine, the 5th is breast etc. We can double check that this is consistent with our sample details.

```{r}
head(s.info)
```
This is also why the check for the order of samples in the **data** object matching that of **s.info** was so important: the rows of the design matrix are assumed by R to correspond to the order of columns in the **data** object.
**Double-checking everything is correctly ordered/defined at this point is important to be sure we run the analysis with samples correctly assigned to their respective groups!**


### Estimating gene dispersions 

The variability of a gene across samples is a crucial factor in determining the significance of a change in gene expression. edgeR computes a common dispersion estimate for all genes, a trended estimate by mean expression, and individual dispersion estimates for each gene. By default, edgeR will make use of the gene-specific estimates, which are now thought to be the most appropriate choice for assessing differential expression. They are also adjusted to make them more robust to the typically small sample size in gene expression studies. With small samples, the estimates of mean and variance are generally poorer and can, for example, lead to false positive results if the variance is under-estimated. Built into limma and edgeR is a procedure to ‘borrow information between genes’ and improve the robustness of the variance estimates by ‘squeezing’ them towards a common value, according to how other genes with similar expression levels behave.


```{r, eval=TRUE}

y <- estimateGLMCommonDisp(y, design, verbose=T) 
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

plotBCV(y) # a plot of the mean-variance relationship
```


### Fitting the model

This step fits the negative binomial (NB) glm to each gene separately. Even for thousands of genes, it completes almost immediately. 

```{r}
fit <- glmFit(y, design)
```

We can now define a contrast matrix to extract the comparisons of interest - this time all pair-wise comparisons between the 3 groups. Note we can refer to the groups by the column names of the design matrix.

```{r}
cont.matrix <- makeContrasts(BRCA-KIRC, BRCA-UCEC, KIRC-UCEC, levels=design)
cont.matrix

```

The first contrast is BRCA vs KIRC, indicated by the 1 and -1 in the first column of the contrast matrix. This means that genes more highly expressed in the first group, BRCA, will have positive logFCs and genes with lower expression in BRCA compared to KIRC samples will have negative logFCs. Genes similarly expressed in both groups will result in logFCs close to zero. In this dataset, there is no natural 'reference' group, but often it makes sense to have control group as the second group, so changes are reported relative to control condition. It's important to communicate the order of comparison when sharing results of a differential expression analysis. 

### Extracting the BRCA vs KIRC comparison

```{r}
res.brca.kirc <- glmLRT(fit, contrast=cont.matrix[,1])

class(res.brca.kirc)
```

This step completes the testing for differential expression and stores the results in a new object we have named **res.brca.kirc**. This object contains a large number of components relating to the statistical testing. The main results table can be accessed with:

```{r}
head(res.brca.kirc$table)                 
```

This is a similar format to the topTable output of limma, and contains the results for all analysed genes. Notice that they are in alphabetical order at this point and also that only raw p-values are given. The testing procedure generates a likelihood ratio (LR) as a test statistic and corresponding p-values. The LR comes from comparing the fit of the full model (with our explanatory variable, the different groups) to a reduced model not including this explanatory variable - when a gene is differentially expressed between groups, the fit will be better under the full model, and if it is substantially better will generate a significant p-value. 


We can summarise the genes called significantly differentially expressed.

```{r}
de <- decideTestsDGE(res.brca.kirc)

## by default summarises results at 5% FDR and tells us how many significantly up- and down-regulated genes there are

summary(de <- decideTestsDGE(res.brca.kirc))
```

We can also visualise the results for each gene by plotting mean expression level vs logFC.
```{r}
detags <- rownames(y)[as.logical(de)]
plotSmear(res.brca.kirc, de.tags=detags, main="BRCA vs KIRC", ylim=c(-10,10))
abline(h=c(-1,1), col="blue")
```

To rank the genes according to evidence for differential expression between the BRCA and KIRC samples we can run the following commands.
```{r}
o <- order(res.brca.kirc$table$PValue)
res.brca.kirc <- res.brca.kirc$table[o,]
```

It is common practice to adjust the raw p-values for multiple testing (of thousands of genes) with a false discovery rate correction. A more detailed explanation of multiple testing correction can be found on [p5 of this tutorial](https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day2_271118/Part2_statistical_hypothesis_testing_tutorial_271118.pdf)
```{r}
## compute and add adjusted p-values
adjp <- p.adjust(res.brca.kirc$PValue, method="fdr")
res.brca.kirc <- cbind(res.brca.kirc, adjp)
head(res.brca.kirc)
write.table(res.brca.kirc, "Cancer_dataset_BRCA_vs_KIRC_edgeR_results.txt", sep="\t", quote=F, row.names=T)
```

*This saves the entire output to file - it is always worth writing out the full table of genes as it can be filtered to significant genes later, but if you want to check the result for a particular gene (maybe it just missed the significance threshold) and had only saved the significant genes, you would need to run all the analysis code again.*

Notice that the file naming convention is consistent with the way our contrasts were defined earlier and helps interpret the logFCs correctly - they could be cross-checked against raw data but otherwise there would be no way of knowing which group was used as the reference. Again by convention, the logFCs are log2 scale with 0 indicating no difference between groups, positive logFCs indicate higher expression in the first group, and negative logFcs indicate higher expression in the second group. 


```{r}
length(which(res.brca.kirc$adjp<0.05)) # 7434
```



## Exercise

Repeat the last section to extract differential expression results for:

1. Breast vs uterine cancer samples

2. Kidney vs uterine cancer samples

Store each set of results in a suitably-named object and note how many significant genes there are in each list. Use the **intersect** function to determine the overlap between each pair of lists and maybe try creating a Venn diagram.

***

## Data visualisation

There are various ways that the results can be visualised and further explored for biological insights. 

For example, we can make boxplots for the top 10 differentially expressed genes for the breast vs kidney comparison and save them to a PDF. The output file can be opened from your working directory once this segment of code has been run. 

```{r}
cpm <- as.matrix(cpm(y))
pdf("Cancer_dataset_boxplots_top10_genes_BRCAvsKIRC.pdf", onefile=T)
# use a loop to iterate over 10 genes
for(i in 1:10)
{
	x <- match(row.names(res.brca.kirc)[i], row.names(cpm))
	boxplot(cpm[x,]~conds, las=2, cex.axis=0.8, main=paste0(row.names(res.brca.kirc)[i]), ylab="Counts per million")
}
dev.off()
```

We can also make a heatmap for, say, the top 25 differentially expressed genes.

```{r}
# Now make a new matrix of counts for the top 25 genes
top.25.y <- y$counts[rownames(y$counts) %in% row.names(res.brca.kirc[1:25,]),]
# Only show BRCA and KIRC samples
top.25.y <- top.25.y[,s.info$Type %in% c("BRCA","KIRC")]
# Now make the heatmap
pdf("Cancer_dataset_heatmap_top25_genes_BRCAvsKIRC.pdf")
heatmap(top.25.y,Rowv=NA)
dev.off()
```

We could also explore the data in other ways such as experimenting with the thresholds for outliers or looking at over-represented biological pathways. 

## Pathway analysis
edgeR offers some ways to explore the functional roles of genes found differentially expressed, as do many other R/Bioconductor packages such as [Goseq](https://bioconductor.org/packages/release/bioc/html/goseq.html) or tools such as [GSEA](http://software.broadinstitute.org/gsea/index.jsp) from the Broad Institute, which is also implemented in R. The [biomaRt](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) package is very useful for mapping between and retrieving different gene annotations and identifiers. 

## Further reading
The original limma package for microarray data was subsequently extended to include methods to deal with RNA-Seq data by transforming the count data prior to fitting standard linear models (the alternative to using models specifically for count data). There was some dilemma over which approach would be best and in the end they did both! This is the limma-voom approach and is described in Chapter 15 of the limma Users Guide. If you are really keen, you can run the same analysis with limma-voom and see how the results compare. 

There are also a host of other packages and methods for analysing RNA-Seq data - we have focused on edgeR as it is the one we primarily use, largely based on the strong reputation and track record of its developers.

More resources can be found here:

https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials/resources/RNA-Seq_Analysis_further_resources.pdf

***