After putting RNA-Seq data through a suitable pipeline, a gene count table will be generated with columns corresponding to samples and rows to genes. This is processed raw data and our starting point for analysis. It is crucial to start by understanding the characteristics of the data as it will inform the eventual analysis. It is useful to make a variety of plots assessing data quality and patterns, which can help identify any poor or failed samples that might need to be excluded as outliers, and to understand which factors have a strong influence on gene expression profiles.
The following exercise illustrates a few scenarios that might typically occur. Try running the example datasets and producing QC plots for them - a script has been provided for making the plots and the tutorial below will take you through the steps.
This series of exercises gives an opportunity to get hands-on experience identifying potential QC issues in differential expression projects. Most of these data sets are simulated and are simpler than real-world examples.
The QC exploration will be conducted in R and the following packages need to be installed:
If these have not already been installed, then as long as you have an internet connection they can be installed using the commands:
install.packages("ggplot2")
install.packages("reshape2")
I have also provided a script that creates pdf documents with a few simple QC plots. This will be the primary means of exploring these data sets. To load the script into your R session, use the command:
source("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/scripts/simple_RNA_QC.R")
For each example, create the simple QC plots and consider the following:
This example uses the cancer cell line data set that we will be analysing later today. We’ll go through this example in depth. The data comes in two parts:
In this set, there are three different cancer cell lines.First, load the data into R:
example1 <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Cancer_gene_expression_data.txt", header=T, row.names=1, sep="\t")
example1.info <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Cancer_sample_info.txt", header=T, sep="\t")
You can inspect the data in R if you wish. Now you can run the script that generates the plots:
simple.RNA.QC(example1, example1.info, "Example1")
This will create a pdf document called “Example1.pdf” in your working directory, which you should be able to open and inspect. If you are working in R Studio and prefer to look at the plots from there, omit the last argument:
simple.RNA.QC( example1, example1.info)
In this example, the count table does not include any special categories for reads so there are no values for ‘Unmapped’ in the bar chart. There is also quite a lot of variability in the number of reads per sample, but this is not unexpected in cancer cell line data. The heatmap shows that one sample, ‘A7.A3J0’, is more distant than the others in general, and might be considered an outlier, but given the inherent variability in cancer cell lines it might be a genuine difference rather than a technical failure. The two PCA plots reiterate this pattern, with A7.A3J0 being isolated in the plot. The second plot also shows that there seems to be some clustering based on group, although it is difficult to pick out due to this potential outlier dominating the PCA plot. Now, let’s remove the outlier and recreate the plots:
example1a <- example1[,-2]
example1a.info <- example1.info[-2,]
simple.RNA.QC(example1a, example1a.info, "Example1a")
Revisiting the plots with the potential outlier excluded, there is a much more noticeable pattern in the PCA. Although some samples are more distant than others, they the different cell lines seem to form lines in the plot which is the sort of pattern expected from cancer cell line data. Whether or not to include the possible outlier in any analysis is something of an open question. In non-cancer projects, it would be a clear candidate for exclusion based on how different it is, but here it is less certain. One possibility is to perform two analyses, one with the sample and one without, and see if there any significant differences in the gene lists produced.
This dataset is for a simulated gene knockout experiment with 2 groups.
Use Example 1 as the model to work from and check the QC plots for this data. The other examples will all use the same naming format as this one. We will discuss Examples 4, 8 and 12 in particular but you can try as many of the examples as you like. Here is how you would produce QC plots for Example 2:
example2 <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Example2_dataset.txt", header=T, row.names=1, sep="\t")
example2.info <- read.table("http://www.well.ox.ac.uk/bioinformatics/training/RNASeq_Nov2020/data/Example2_sample_info.txt", header=T, sep="\t")
simple.RNA.QC(example2, example2.info, "Example2")
Three groups, two with distinct treatments and one untreated.
Two groups, depleted and undepleted.
Two groups, normoxia and hypoxia.
Two groups, wildtype and knockout.
A two-factor model: low phosphate versus high phosphate, in untreated and treated samples, for four groups in total.
A two-factor model: treated versus untreated, in wildtype and knockout samples, for four groups in total.
Two types of treatment, with samples taken at 3 different time points, for 6 groups in total.
Two groups, treated and untreated.
Two groups, treated and untreated.
Treated and untreated samples, across three time points, for 6 groups in total.
Three different treatment groups.
Treated and untreated samples, across three time points, for 6 groups in total.
Two treatment groups.