---
title: "RNA-Seq Workshop - Part 1"
date: 14 December 2021
author: Helen Lockstone and Ben Wright, Bioinformatics Core, WHG
output: html_document
---

Tutorial guides and markdown files can be found at the course link: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Dec2021

Data files and other resources are available here:
https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials

```{r include = FALSE}
knitr::opts_chunk$set(eval=FALSE)
```

## Data Preprocessing

In this practical,  we will look at some RNA-Seq data from the Cancer Genome Atlas project (https://cancergenome.nih.gov). We'll use a dataset of 50 samples including 3 different cancer types (breast, kidney and endometrial) and we are interested to identify differentially expressed genes. Before we can do that, a number of preprocessing steps need to be performed - this tutorial will guide you through the steps of reading in data to R, dealing with formatting issues, then normalising and filtering the data. 


### Setup

1. Create a new folder for today's workshop on your computer

2. Open a new R session by launching RStudio

3. Set the working directory in RStudio to match the location of the your new folder 

This can be done via the RStudio main menu toolbar at the top: Session > Set working directory > Choose working directory...

Navigate through your filesystem to the correct directory and click OK. A command **setwd("path_to_directory")** should automatically be executed in the R console - the bottom left panel in RStudio. You can also confirm the correct setting of the working directory with the **getwd** command:

```{r}
getwd()
```

R will by default look in the current working directory for files to read in, and the location to write any output files we generate. 

* The easiest way to run this tutorial yourself in R is to download the version of this practical with the .rmd extension (rmarkdown file) to your working directory. If you then open this file in RStudio (click on the open file icon in the top left text editor panel), this will allow the chunks of code to be run directly when you reach them - the commands will be automatically executed in the console panel, and any output including plots will appear in the text editor panel.
* You can alternatively run any commands shown in this document by copy/pasting the R commands into the topleft panel of RStudio and clicking the icon with a green forward arrow in the toolbar. The line of code where the cursor is will be automatically copied into the bottom left panel and executed by R in realtime. You should see the R prompt **>** again when it has finished. Depending on the command, some output may also be printed to the screen or plot window. You can save the file afterwards to have a record of what you have done and can easily re-run again in future. 

If you see an error message, first double check that the command is entered exactly as shown. If you can't find the problem, please ask for help.

To continue, download the data files that we will be using by running the following two commands in R. 

```{r eval=FALSE}
## file containing gene expression data
download.file("https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials/data/Cancer_gene_expression_data.txt", "./Cancer_gene_expression_data.txt")

## file with sample information
download.file("https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_materials/data/Cancer_sample_info.txt", "./Cancer_sample_info.txt")
```

The **"./name_of_file"** part tells R to save the file in the current working directory and the two files should now be visible there on your computer. 

The final preparation step is to install and load a Bioconductor package for performing RNA-Seq analysis. We will use edgeR and you can read more details about this package [here](https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf). See [here](https://bioconductor.org/) for an overview of Bioconductor itself.

If you don't already have edgeR installed, run the following segment of code:
```{r eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("edgeR")
```

To load the package into the current session, use the following command (you need to load a package each time you open a new R session, but the installation only needs to be done once)

```{r eval=FALSE}
library(edgeR)
```

***

## Reading data into R and preparing for analysis

Now, we can read in the gene expression data file and check its contents - there should be 23368 rows corresponding to genes, and 50 columns corresponding to samples.

```{r}
data <- read.table("Cancer_gene_expression_data.txt", sep="\t", header=T, row.names=1)
dim(data)

head(data[,1:6]) # shows the first 6 rows (default for head command) for the first 6 samples, which are selected using the 1:6 in square brackets
```

These are raw counts from an RNA-Seq experiment. Next read in the sample details from the second file. 

```{r}
s.info <- read.table("Cancer_sample_info.txt", sep="\t", header=T)
dim(s.info) 	
head(s.info)
```

*NB In the current version of R (v4.0.3 and later), strings (text columns) are read in as character vectors. In earlier versions of R the default was to store them as factors - a different type of R object. Often this was unhelpful as factors behave differently to vectors; an extra argument stringsAsFactors=F was required in the **read.table** command to over-ride this behaviour.*

At this point, it is critical to ensure that the samples are in the same order in the data object and the accompanying sample information as future steps will assign experimental conditions assuming this to be the case. 

To do this, we can compare the column names of our **data** object with the Sample_ID column of **s.info**. First, we need to deal with some automatic formatting R performed internally when reading in the data - if you look back at the output of the **head** commands, you will notice that the column names of **data** have dots replacing the original dash symbols in  *Cancer_gene_expression_data.txt*. 

*Indeed the original file still contains the dashes, it is only altered in the R environment to ensure the column names conform to R's internal naming rules. Note this only applies to column names, and thus the dashes are also retained in column 1 of the **s.info** object. Another rule regarding column names is that they must begin with a letter.*

To check if the two vectors match, we need to make the same dash to dot substitution in the sample IDs in the **s.info** object. 

```{r}
s.info$Sample_ID <- gsub("-", ".", s.info$Sample_ID)

## now check if the colnames of data and sample IDs in s.info match
identical(colnames(data), s.info$Sample_ID) 

```

Unfortunately it returns FALSE, indicating a discrepancy somewhere and we need to check what it might be. (If you are quite familiar with R, have a think about how this might be done).

From a quick inspection of the first few samples, they look to be in the same order:

```{r}
head(colnames(data))
head(s.info$Sample_ID)

## we can use the R function setdiff to help track down the issue:
setdiff(colnames(data), s.info$Sample_ID)
```

One of the original sample IDs, 3Z-A93Z, started with a number, which is not permitted for column names in R and so it has automatically been prefixed with an X to create a valid name. Therefore, we need to modify the sample table again for this specific sample:

```{r}
match("3Z.A93Z", s.info[,1])
```

This tells us it is element (row) 32 of the first column of **s.info** that needs modifying. This can be done by assigning it the sample name as it appears in the column names of **data**.
```{r}
s.info[32, 1] <- "X3Z.A93Z"

## finally re-running the previous command confirms they are now identical
identical(colnames(data), s.info$Sample_ID) 
```

The formatting issues seen here are typical of many datasets where inconsistent naming patterns or special characters can cause problems for data handling and analysis. It cannot be emphasised enough how important it is to check these details carefully before embarking on analysis steps, as it is very easy to make inadvertent mistakes that could render the resuslts meaningless. 

Now everything is correct, we can proceed with setting up the analysis. Let's see what type of samples we have and how many of each. 

```{r}
table(s.info$Type)
```

BRCA indicates breast cancer samples, KIRC for kidney/renal and UCEC uterine/endometrial. Note  that the total number of samples is 50, consistent with our R objects.  Now we can define a factor for our experimental groups (cancer types). 

```{r}
conds <- factor(s.info$Type)

## inspect the new factor object
conds
```

Returning to the count data, we might also want to check the sequencing depth (number of millions of reads) for each sample.

```{r}
read.depth <- apply(data, 2, sum)/1000000 
summary(read.depth)
barplot(read.depth, las=2, cex.names=0.5)
```

The median sequencing depth is ~47 million reads, but there is high variability between samples as illustrated in the barplot.

*NB the **apply** function is a very efficient way to perform a particular task across columns of an object (indicated by the 2 in the command above) or rows (1), in this case computing the sum of all reads in each sample.*

***

# Performing differential expression analysis with edgeR

The edgeR package was developed by Gordon K Smyth and colleagues at the Walter and Eliza Hall Institute in Melbourne around 10 years ago and has been extended and improved ever since. The same group previously developed the widely-used 'limma' framework for microarray data and many of the concepts were applied to RNA-Seq data. edgeR implements statistical models suitable for count data, namely the negative binomial model, in a generalized linear model (glm) framework. It performs the computations very efficiently and can handle complex experimental designs or additional explanatory variables. It also performs a normalisation step that accounts for differences in sequencing depth between samples and the behaviour of RNA-seq in specific situations, such as when a subset of genes is highly expressed in one of the experimental conditions but not another; this can lead to over-sampling of the highly-expressed genes in one condition leaving other genes relatively under-sampled. If this is not corrected for in the normalisation step, many of those other genes could appear down-regulated in the condition with the highly-expressed genes but this would be an artefact rather than biological changes. 

Loading the edgeR package in the current R session as we did at the start of the workshop gives us access to all the required functionality. R/Bioconductor packages wrap the detailed steps of their methods into individual functions, which will be available to R once the package has been loaded into the current session. From the user's perspective, the analysis is then performed by running a series of functions (steps) tailored to your particular dataset. All packages have documentation for their usage, and the edgeR package is particuarly well-documented with a detailed User Guide, which includes a number of case-study examples. 

As with all R packages, care must be taken to run all the steps in the right order and ensure any parts requiring manual set up are correct. In the context of gene expression analysis, a particularly important part to get right is the design matrix and ensuring the coefficient or contrast for the comparison(s) of interest are correctly extracted afterwards. This is what we will focus on in the next practical session.  For now, we will finish preparing our data ready for analysis. 

## Pre-processing steps

### Creating an edgeR data object

```{r}
## read data into edgeR object
y <- DGEList(counts=data, genes=row.names(data))
head(y$counts[,1:6]) # inspect the first 6 samples

## the counts per million (cpm) values have been computed and stored as well
head(cpm(y)[,1:6])
## they can be saved to a file if desired 
write.table(cpm(y), "Cancer_dataset_counts_per_million.txt", sep="\t", quote=F, row.names=T)
```

Our data is now stored in a DGEList object - this object class has been defined in the edgeR package to handle specific features of RNA-Seq data, and store the output of functions performed in the analysis process. It is a list object, which at this stage has 3 elements. These have pre-defined names and each component can be accessed using the $ symbol after the name of the DGEList object, **y**. 

```{r}
names(y)
head(y$counts)
```

Use the **head** command to inspect the other components of the object. What do you think they represent? 

### Filtering out low expressed genes

A common step in gene expression data analysis is to remove genes with zero or very low read counts. In a typical RNA-Seq dataset, many genes have zero counts in all samples and these cannot be analysed at all. They represent genes either unexpressed in the particular cell type and condition under study, or at such low levels such that they are not sampled in the sequencing process. If there are only a handful of reads for a given gene, it is also difficult to make any statistical inference about differential expression and so a threshold of >10 reads might be applied. 

Filtering can be performed as follows:

```{r}
## defining a filter to remove low expressed genes
## median depth is ~50m reads per sample
## raw count >10 corresponds to 0.2 cpm (counts per million) in this dataset
## smallest group size is 9
## retain genes with cpm>=0.2 in at least 9 samples
## note filter is independent of sample group (can be any 9 samples)


keep <- rowSums(cpm(y)> 0.2) >= 9
```

This line of code is rather clever and very efficient. It is used in the edgeR case studies* but it is not intuitive as to what it is doing. Usually the **rowSums** function will compute the sum of each row in a matrix. The inclusion of the >0.2 returns instead the number of elements (i.e. samples) in that row for which the condition is met. The final >=9 further changes the nature of the output returned by the function, and this time returns TRUE if the condition is met in 9 or more samples and FALSE otherwise.* 

*At least it used to be, there is now an in-built function called **filterByExpr**. You can explore how this works with:

```{r} 
help("filterByExpr")
```

You can see this behaviour by building up the command in stages:
```{r}
head(cpm(y)[,1:6]) # shows the CPM values for part of the dataset

head(rowSums(cpm(y))) # shows the sum of CPM values across each row for the first few rows

head(rowSums(cpm(y) > 0.2)) # shows the number of samples with values above 0.2 in each row

head(rowSums(cpm(y) > 0.2)) >= 9 # TRUE/FALSE vector for whether more than 9 samples in each row have CPM>0.2 or not
```

We can determine how many genes pass the filter using the ever-useful table function:

```{r}
table(keep)
```

So 18483 genes passed the filter, while 4885 did not. Relatively few genes are excluded with this filter, which may be linked to the high sequencing depth or the source of gene annotations (RefSeq vs Ensembl).

We now need to extract only the genes passing the filter from our data object
```{r}
y <- y[keep,]
dim(y) 
```

### Data normalisation

We can now proceed to normalising the data using edgeR's bespoke method, TMM, the **trimmed mean of M-values**. (M-values are otherwise known as the fold-changes, with the terminology originally coming from microarray data). The normalisation procedure finds a set of scaling factors that minimizes the logFCs between the samples for most genes - it assumes the majority of genes are not likely to be differentially expressed, which is reasonable for most studies.


```{r}
## calculate normalisation factors
y <- calcNormFactors(y)
class(y)

names(y)

head(y$counts[,1:6]) # header of the count matrix for the first 6 samples
head(y$samples) # sample details including library size i.e. total reads or depth and the normalisation factors

```

Notice how the normalisation factors in the object have been updated from their default values of 1 before the normalisation step was run. 

### Data exploration
edgeR also provides a visualisation of sample clustering with a multi-dimensional scaling plot. 
```{r}
plotMDS(y)

## refining the plot to label by cancer types rather than sample IDs

plotMDS(y, labels=conds, cex=0.6)
```

This plot shows strong clustering according to the cancer type, which suggests distinct transcriptional profiles and that we will find many signficantly differentially expressed genes in the analysis step. 

*Bear in mind that this exploratory plot is based on the overall gene expression profile - other datasets may not show strong clustering patterns but there may well still be a subset of genes significantly differentially expressed.*

The next section of the tutorial can be found here: 
https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2021/RNASeq_practical_part2_221121.html

If you leave your R session open, you can continue directly with Part 2, however if you accidentally close the R session you'll need to re-run the above code before moving on (this can be done quickly by loading the RMarkdown file for Part 1 into RStudio's editor panel and running each chunk of code again). 

Alternatively, to save the session and all objects/steps performed so far, click on the Session menu and 'Save Workspace As' and give it a name like RNASeq.RData - this will create an .RData file in your local working directory with your saved session (it is fairly large so you may want to delete it when you are completely finished). If you need to re-start R at any point, the workspace can be loaded with everything intact via Session -> Load Workspace

***