Data Preprocessing

In this practical, we will look at some real RNA-Seq data from the Cancer Genome Atlas project (https://cancergenome.nih.gov). We’ll use a dataset of 50 samples including 3 different cancer types (breast, kidney and endometrial) and we are interested to identify differentially expressed genes. Before we can do that, a number of preprocessing steps need to be performed - this tutorial will guide you through the steps of reading in data to R, dealing with formatting issues, then normalising and filtering the data.

Setup

All material can be found at the course link: https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020

  1. Create a new folder for today’s workshop on your computer

  2. Open a new R session by launching RStudio

  3. Set the working directory in RStudio to match the location of the your new folder

This can be done via the RStudio main menu toolbar at the top: Session > Set working directory > Choose working directory…

Navigate through your filesystem to the correct directory and click OK. A command setwd(“path_to_directory”) should automatically be executed in the R console - the bottom left panel in RStudio. You can also confirm the correct setting of the working directory with the getwd command:

getwd()

R will by default look in the current working directory for files to read in, and the location to write any output files we generate.

  • The easiest way to run this tutorial yourself in R is to download the version of this practical with the .rmd extension (rmarkdown file) to your working directory. If you then open this file in RStudio (click on the open file icon in the top left text editor panel), this will allow the chunks of code to be run directly when you reach them - the commands will be automatically executed in the console panel, and any output including plots will appear in the text editor panel.*
  • You can alternatively run any commands shown in this document by copy/pasting the R commands into the topleft panel of RStudio and clicking the icon with a green forward arrow in the toolbar. The line of code where the cursor is will be automatically copied into the bottom left panel and executed by R in realtime. You should see the R prompt > again when it has finished. Depending on the command, some output may also be printed to the screen or plot window. You can save the file afterwards to have a record of what you have done and can easily re-run again in future.

If you see an error message, first double check that the command is entered exactly as shown. If you can’t find the problem, please ask for help.

To continue, download the data files that we will be using by running the following two commands in R.

## file containing gene expression data
download.file("https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Cancer_gene_expression_data.txt", "./Cancer_gene_expression_data.txt")

## file with sample information
download.file("https://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Cancer_sample_info.txt", "./Cancer_sample_info.txt")

The “./name_of_file” part tells R to save the file in the current working directory and the two files should now be visible there on your computer.

The final preparation step is to install and load a Bioconductor package for performing RNA-Seq analysis. We will use edgeR and you can read more details about this package here. See here for an overview of Bioconductor itself.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("edgeR")

library(edgeR)

Reading data into R and preparing for analysis

Now, we can read in the gene expression data file and check its contents - there should be 23368 rows corresponding to genes, and 50 columns corresponding to samples.

data <- read.table("Cancer_gene_expression_data.txt", sep="\t", header=T, row.names=1)
dim(data)

head(data[,1:6]) # shows the first 6 rows (default for head command) for the first 6 samples, which are selected using the 1:6 in square brackets

These are raw counts from an RNA-Seq experiment. Next read in the sample details from the second file.

s.info <- read.table("Cancer_sample_info.txt", sep="\t", header=T)
dim(s.info)     
head(s.info)

NB In the latest version of R v4.0.3, strings (text columns) are read in as character vectors. In earlier versions of R, they were treated as factors by default, and an extra argument stringsAsFactors=F was required in the read.table command to over-ride this often unhelpful default behaviour.

At this point, it is critical to ensure that the samples are in the same order in the data object and the accompanying sample information as future steps will assign experimental conditions assuming this to be the case.

To do this, we can compare the column names of our data object with the Sample_ID column of s.info. First, we need to deal with some automatic formatting R performed internally when reading in the data - if you look back at the output of the head commands, you will notice that the column names of data have dots replacing the original dash symbols in Cancer_gene_expression_data.txt.

Indeed the original file still contains the dashes, it is only altered in the R environment to ensure the column names conform to R’s internal naming rules. Note this only applies to column names, and thus the dashes are also retained in column 1 of the s.info object. Another rule regarding column names is that they must begin with a letter.

To check if the two vectors match, we need to make the same dash to dot substitution in the sample IDs in the s.info object.

s.info$Sample_ID <- gsub("-", ".", s.info$Sample_ID)

## now check if the colnames of data and sample IDs in s.info match
identical(colnames(data), s.info$Sample_ID) 

Unfortunately it returns FALSE, indicating a discrepancy somewhere and we need to check what it might be. (If you are quite familiar with R, have a think about how this might be done).

From a quick inspection of the first few samples, they look to be in the same order:

head(colnames(data))
head(s.info$Sample_ID)

## we can use the R function setdiff to help track down the issue:
setdiff(colnames(data), s.info$Sample_ID)

One of the original sample IDs, 3Z-A93Z, started with a number, which is not permitted for column names in R and so it has automatically been prefixed with an X to create a valid name. Therefore, we need to modify the sample table again for this specific sample:

match("3Z.A93Z", s.info[,1])

This tells us it is element (row) 32 of the first column of s.info that needs modifying. This can be done by assigning it the sample name as it appears in the column names of data.

s.info[32, 1] <- "X3Z.A93Z"

## finally re-running the previous command confirms they are now identical
identical(colnames(data), s.info$Sample_ID) 

The formatting issues seen here are typical of many datasets where inconsistent naming patterns or special characters can cause problems for data handling and analysis. It cannot be emphasised enough how important it is to check these details carefully before embarking on analysis steps, as it is very easy to make inadvertent mistakes that could render the resuslts meaningless.

Now everything is correct, we can proceed with setting up the analysis. Let’s see what type of samples we have and how many of each.

table(s.info$Type)

BRCA indicates breast cancer samples, KIRC for kidney/renal and UCEC uterine/endometrial. Note that the total number of samples is 50, consistent with our R objects. Now we can define a factor for our experimental groups (cancer types).

conds <- factor(s.info$Type)

## inspect the new factor object
conds

Returning to the count data, we might also want to check the sequencing depth (number of millions of reads) for each sample.

read.depth <- apply(data, 2, sum)/1000000 
summary(read.depth)
barplot(read.depth, las=2, cex.names=0.5)

The median sequencing depth is ~47 million reads, but there is high variability between samples as illustrated in the barplot.

The apply function is a very efficient way to perform a particular task across columns of an object (indicated by the 2 in the command above) or rows (1), in this case computing the sum of all reads in each sample.


Performing differential expression analysis with edgeR

The edgeR package was developed by Gordon K Smyth and colleagues at the Walter and Eliza Hall Institute in Melbourne around 10 years ago and has been extended and improved ever since. The same group previously developed the widely-used ‘limma’ framework for microarray data and many of the concepts were applied to RNA-Seq data. edgeR implements statistical models suitable for count data, namely the negative binomial model. It performs the computations very efficiently and can handle complex experimental designs through a generalized linear model (glm) framework. It also performs a normalisation step that accounts for the behaviour of RNA-sequencing in specific situations, such as when a subset of genes are highly expressed in one of the experimental conditions but not another; this can lead to over-sampling of the highly-expressed genes in one condition leaving other genes relatively under-sampled. If this is not corrected for in the normalisation step, many of those other genes would appear artificially down-regulated in the condition with the highly-expressed genes.

Loading the edgeR package in the current R session as we did at the start of the workshop gives us access to all the required functionality. R/Bioconductor packages wrap the detailed steps of their methods into individual functions, which will be available to R once the package has been loaded into the current session. From the user’s perspective, the analysis is then performed by running a series of functions (steps) tailored to your particular dataset. All packages have documentation for their usage, and the edgeR package is particuarly well-documented with a detailed User Guide, which includes a number of case-study examples.

As with all R packages, care must be taken to run all the steps in the right order and ensure any parts requiring manual set up are correct. In the context of gene expression analysis, a particularly important part to get right is the design matrix and ensuring the coefficient or contrast for the comparison(s) of interest are correctly extracted afterwards. This is what we will focus on in the third practical session. For now, we will finish preparing our data ready for analysis.

Pre-processing steps

Creating an edgeR data object

## read data into edgeR object
y <- DGEList(counts=data, genes=row.names(data))
head(y$counts[,1:6]) # inspect the first 6 samples

## the counts per million (cpm) values have been computed and stored as well
head(cpm(y)[,1:6])
## they can be saved to a file if desired 
write.table(cpm(y), "Cancer_dataset_counts_per_million.txt", sep="\t", quote=F, row.names=T)

Our data is now stored in a DGEList object - this object class has been defined in the edgeR package to handle specific features of RNA-Seq data, and store the output of functions performed in the analysis process. It is a list object, which at this stage has 3 elements. These have pre-defined names and each component can be accessed using the $ symbol after the name of the DGEList object, y.

names(y)
head(y$counts)

Use the head command to inspect the other components of the object. What do you think they represent?

Filtering out low expressed genes

A common step in gene expression data analysis is to remove genes with zero or very low read counts. In a typical RNA-Seq dataset, many genes have zero counts in all samples and these cannot be analysed at all. They represent genes either unexpressed in the particular cell type and condition under study, or at such low levels such that they are not sampled in the sequencing process. If there are only a handful of reads for a given gene, it is also difficult to make any statistical inference about differential expression and so a threshold of >10 reads might be applied.

Filtering can be performed as follows:

## defining a filter to remove low expressed genes
## median depth is ~50m reads per sample
## raw count >10 corresponds to 0.2 cpm (counts per million) in this dataset
## smallest group size is 9
## retain genes with cpm>=0.2 in at least 9 samples
## note filter is independent of sample group (can be any 9 samples)


keep <- rowSums(cpm(y)> 0.2) >= 9

This line of code is rather clever and very efficient. It is used in the edgeR case studies* but it is not intuitive as to what it is doing. Usually the rowSums function will compute the sum of each row in a matrix. The inclusion of the >0.2 returns instead the number of elements (i.e. samples) in that row for which the condition is met. The final >=9 further changes the nature of the output returned by the function, and this time returns TRUE if the condition is met in 9 or more samples and FALSE otherwise.*

*At least it used to be, there is now an in-built function called filterByExpr. You can explore how this works with:

help("filterByExpr")

You can see this behaviour by building up the command in stages:

head(cpm(y)[,1:6]) # shows the CPM values for part of the dataset

head(rowSums(cpm(y))) # shows the sum of CPM values across each row for the first few rows

head(rowSums(cpm(y) > 0.2)) # shows the number of samples with values above 0.2 in each row

head(rowSums(cpm(y) > 0.2)) >= 9 # TRUE/FALSE vector for whether more than 9 samples in each row have CPM>0.2 or not

We can determine how many genes pass the filter using the ever-useful table function:

table(keep)

So 18483 genes passed the filter, while 4885 did not. Relatively few genes are excluded with this filter, which may be linked to the high sequencing depth or the source of gene annotations (RefSeq vs Ensembl).

We now need to extract only the genes passing the filter from our data object

y <- y[keep,]
dim(y) 

Data normalisation

We can now proceed to normalising the data using edgeR’s bespoke method, TMM, the trimmed mean of M-values. (M-values are otherwise known as the fold-changes, with the terminology originally used for microarray data).

## calculate normalisation factors
y <- calcNormFactors(y)
class(y)

names(y)

head(y$counts[,1:6]) # header of the count matrix for the first 6 samples
head(y$samples) # sample details including library size i.e. total reads or depth and the normalisation factors

Notice how the normalisation factors in the object have been updated from their default values of 1 before the normalisation step was run.

Data exploration

edgeR also provides a visualisation of sample clustering with a multi-dimensional scaling plot.

plotMDS(y)

## refining the plot to label by cancer types rather than sample IDs

plotMDS(y, labels=conds, cex=0.6)

This plot shows strong clustering according to the cancer type, which suggests we will find many signficantly differentially expressed genes in the analysis step.

Bear in mind that this exploratory plot is based on the overall gene expression profile and even if strong clustering is not observed for a particular dataset, there may well still be a subset signifcant genes to be found in the analysis.