RNA Workshop Session 3 Practical

This series of exercises gives an opportunity to get hands-on experience identifying potential QC issues in differential expression projects. Most of these data sets are simulated and are simpler than real-world examples. The QC exploration will be conducted in R. It's not required to complete all of these examples today, or indeed ever; rather enough examples are included to try to ensure there will be enough for you to get the hang of it, even if you want to continue after today. The key thing to bear in mind is that these exercises are lateral-thinking puzzles - working out if something has gone wrong, and what might have caused it.

Software Required

Within R, the following packages need to be installed:

ggplot2
reshape2

If these have not already been installed, then as long as you have an internet connection they can be installed using the commands:

install.packages("ggplot2")  
install.packages("reshape2")

I have also provided a script that creates pdf documents with a few simple QC plots. This will be the primary means of exploring these data sets. To load the script into your R session, use the command:

source("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day1_261118/Session3/practical/simple_RNA_QC.R")

Tasks

For each example, create the simple QC plots and consider the following:

What potential problems can you identify from the plots?
What could have caused those problems?
What additional information regarding the project would you need to confirm those causes?
What would you do to account for any QC issues when analysing the data?

Example 1

This example uses a cancer cell line data set that will be revisited in tomorrow's practical session. We'll go through this example in depth.

Firstly, open up the data file in your browser or text editor:

http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day1_261118/Session3/practical/Cancer_gene_expression_dataset.txt

You'll see a table of counts, in the format produced by a typical sequencing pipeline. Each row represents one gene, each column one sample. Notice that although this file contains all of the count data, it does not specify which samples belong to which group, or any other sample meta-data - this has to be stored separately.

In this example, that data is stored in another text file:

http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day1_261118/Session3/practical/Cancer_sample.info.txt

In this set, there are three different cancer cell lines. This file could contains any other information of interest about the samples. Tomorrow, when we look at analysing RNA sequencing data, these are the sorts of factors that can be included in the analysis.

First, we need to load the data into R. Start R and use the following commands:

example1 <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day1_261118/Session3/practical/Cancer_gene_expression_dataset.txt", header=T, row.names=1, sep="\t")
example1.info <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2018/Day1_261118/Session3/practical/Cancer_sample.info.txt", header=T, sep="\t")

You can inspect the data in R if you wish. Now you can run the script that generates the plots:

simple.RNA.QC(example1, example1.info, "Example1")

This will create a pdf document called 'Example1.pdf' you should be able to open and inspect. If you are working in R Studio and prefer to look at the plots from there, omit the last argument:

simple.RNA.QC( example1, example1.info )

Note that in RStudio you can move back and forth between all the plots you have created in a session using the forward and back buttons in the plot window. After running the code, only the last of the plots will be visible but the others will still be there. If you run this second form in R instead of RStudio, all except the last plot will be lost, so use the pdf version above if you aren't using RStudio.

In this example, the count table does not include any special categories for reads so there are no values for 'Unmapped' in the bar chart. There is also quite a lot of variability in the number of reads per sample, but this is not unexpected in cancer cell line data. The heatmap shows that one sample, 'A7.A3J0', is more distant than the others in general, and might be considered an outlier, but given the inherent variability in cancer cell lines it might be a genuine difference rather than a technical failure. The two PCA plots reiterate this pattern, with A7.A3J0 being isolated in the plot. The second plot also shows that there seem to be some clustering based on group, although it is difficult to pick out due to this potential outlier dominating the PCA plot.

Now, let's remove the outlier and recreate the plots:

example1a <- example1[,-2] 
example1a.info <- example1.info[-2,]
simple.RNA.QC( example1a, example1a.info, "Example1a" )

Revisiting the plots with the potential outlier excluded, there is a much more noticeable pattern in the PCA. Although some samples are more distant than others, the different cell lines seem to cluster with respect to one axis which is the sort of pattern expected from cancer cell line data. Whether or not to include the possible outlier in any analysis is something of an open question. In non-cancer projects, it would be a clear candidate for exclusion based on how different it is, but here it is less certain. One possibility is to perform two analyses, one with the sample and one without, and see if there any significant differences in the gene lists produced.

Example 2

This dataset is for a simulated gene knockout experiment with 2 groups. The two files are in the same folder but have the names:

Example2_dataset.txt
Example2_sample_info.txt

Use Example 1 as the model to work from and check the QC plots for this data. The other examples will all use the same naming format as this one.

Example 3

Three groups, two with distinct treatments and one untreated. The same naming convention as for Example 2 is used.

Example 4

Two groups, depleted and undepleted.

Example 5

Two groups, normoxia and hypoxia.

Example 6

Two groups, wildtype and knockout.

Example 7

A two-factor model: low phosphate versus high phosphate, in untreated and treated samples, for four groups in total.

Example 8

A two-factor model: treated versus untreated, in wildtype and knockout samples, for four groups in total.

Example 9

Two types of treatment, with samples taken at 3 different time points, for 6 groups in total.

Example 10

Two groups, treated and untreated.

Example 11

Two groups, treated and untreated.

Example 12

Treated and untreated samples, across three time points, for 6 groups in total.

Example 13

Three different treatment groups.

Example 14

Treated and untreated samples, across three time points, for 6 groups in total.

Example 15

Two treatment groups.