Data preprocessing
First, read in the raw count table:
counts <- read_tsv("GSE211066_raw_counts_GRCh38.p13_NCBI.tsv")
head(counts)
dim(counts)
The gene IDs are stored in column 1, followed by 12 columns of counts
And the sample information file:
s.info <- read_table("GSE211066_sample_info.txt")
s.info
check samples are in same order (manually for now)
colnames(counts)
[1] "GeneID" "GSM6448767" "GSM6448768" "GSM6448769"
[5] "GSM6448770" "GSM6448771" "GSM6448772" "GSM6448773"
[9] "GSM6448774" "GSM6448775" "GSM6448776" "GSM6448777"
[13] "GSM6448778"
We might also want to check the sequencing depth (number of millions of reads) for each sample.
read.depth <- apply(counts[,2:13], 2, sum)/1000000
summary(read.depth)
barplot(read.depth, las=2, cex.names=0.5)
Now, we create a factor to store the information about the experimental conditions and also 'batch' as this experiment involved 2 sets of samples being processed independently.
conds <- factor(s.info$Expt_condition)
batch <- factor(s.info$Expt_batch)
At this point, we start using the edgeR functionality to build our data object, a DGEList - this is a custom object to store all the relevant information including counts, sample IDs, gene IDs etc.
y <- DGEList(counts=counts[,2:13], genes=counts$GeneID)
names(y)
head(y$counts[,1:6]) # inspect the first 6 samples
The counts per million (cpm) values have been computed and stored as well:
head(cpm(y)[,1:6])
Filtering out low expressed genes
A common step in gene expression data analysis is to remove genes with zero or very low read counts. In a typical RNA-Seq dataset, many genes have zero counts in all samples and these cannot be analysed at all. They represent genes either unexpressed in the particular cell type and condition under study, or at such low levels such that they are not sampled in the sequencing process. If there are only a handful of reads for a given gene, it is also difficult to make any statistical inference about differential expression and so a threshold of >10 reads might be applied.
keep <- filterByExpr(y, group=conds, min.count=10)
Inspect the contents of keep
- how is information returned from the filterByExpr
command?
You can check with the following commands:
head(keep)
table(keep)
Reduce the data object to contain only genes that passed the filtering step:
y <- y[keep,]
dim(y)
We can now proceed to normalising the data using edgeR's bespoke method, TMM, the trimmed mean of M-values. (M-values are otherwise known as the fold-changes, with the terminology originally coming from microarray data). The normalisation procedure finds a set of scaling factors that minimizes the logFCs between the samples for most genes - it assumes the majority of genes are not likely to be differentially expressed, which is reasonable for most studies.
Calculate normalisation factors
y <- calcNormFactors(y)
names(y)
head(y$samples) # sample details including library size i.e. total reads or depth and the normalisation factors
A useful plot to look at how the samples cluster is the MDS plot (multi-dimensional scaling) - we can plot with different labels to understand how the data behaves and what the primary sources of variation are.
plotMDS(y)
plotMDS(y, labels=conds, cex=0.6)
plotMDS(y, labels=s.info$Expt_batch, cex=0.6)
plotMDS(y, labels=s.info$Sample_title, cex=0.6)
From this point, we are ready to design and run our analysis