Skip to main content

Identifying differentially expressed genes

We first create a design matrix, which captures the experimental design - there is one row per sample and the columns indicate experimntal groups. We can add covariates if needed but for now we will do a simple comparison of the nutrient-rich vs starved conditions (n=6 in each group). We make use of the conds factor that we created earlier.

design <- model.matrix(~conds)
design

By default, the levels of a factor are alphabetical and this in turn affects how the design matrix is created (look at column 2). This means it is using nutrient-rich group as the reference and the analysis will be performed as starved vs nutrient-rich. We might find it more intuitive to compare nutrient-rich to starved condition - so let's reorder the levels of the original factor:

conds <- factor(conds, levels=c("starved", "nutrient_rich"))

Inspect the conds factor by printing the object and make sure you can see how the order of the levels was swapped.

Now we need to recreate the design matrix to update it:

design <- model.matrix(~conds)
design
colnames(design) <- gsub("conds", "", colnames(design)) # removing 'conds' string from the column names
head(design)

We can see that each sample has a row in the design matrix, and column 2 indicates which samples belong to each experimental group: 1 when the sample belongs to that group, 0 otherwise.

Important! The rows of the design matrix are assumed by R to correspond to the order of columns in the counts table. Because our conds factor was created from the sample info file, we need to have ensured that both our starting files - the counts table and the sample info file have the samples in the same order. Otherwise we would run the analysis with samples likely incorectly assigned to their respective groups!

Estimating gene dispersions

The variability of a gene across samples is a crucial factor in determining the significance of a change in gene expression. edgeR computes a common dispersion estimate for all genes, a trended estimate by mean expression, and individual dispersion estimates for each gene. By default, edgeR will make use of the gene-specific estimates, which are now thought to be the most appropriate choice for assessing differential expression. They are also adjusted to make them more robust to the typically small sample size in gene expression studies. With small samples, the estimates of mean and variance are generally poorer and can, for example, lead to false positive results if the variance is under-estimated. Built into limma and edgeR is a procedure to 'borrow' information between genes and improve the robustness of the variance estimates by squeezing them towards a common value, according to how other genes with similar expression levels behave.

y <- estimateGLMCommonDisp(y, design, verbose=T) 
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

These estimates represent the biological variability in each gene. By default, the tagwise (individual gene estimates) are used in the analysis. The following plot visualises the dispersion estimates for all genes, with the common dispersion value indicated by the horizontal red line, the trended fit shown with the blue line, and the individual gene estimates are shown as black points.

plotBCV(y)

Fitting the model

fit <- glmFit(y, design)
res <- glmLRT(fit)

Inspect results ordered on p-value column:

head(res$table[order(res$table$PValue),])

Finally, we can use decideTests.DGELRT to summarise the results: By default, this summarises results at 5% FDR and tells us how many significantly up- and down-regulated genes there are. We can adjust the FDR level by adding the p.value argument.

de <- decideTests.DGELRT(res)

summary(de <- decideTests.DGELRT(res))
summary(de <- decideTests.DGELRT(res, p.value=0.01))

Congratulations! We have analysed some RNA-Seq data to find differentially expressed genes. Now it is time to explore the results a bit.

Check the paper to see how they defined differentially expressed genes and add a logFC threshold to decideTestsDGE. Do you find a similar number of DE genes? More? Fewer?

Start to investigate your results - you could look for specific genes mentioned in the paper or make a volcano plot of the significant genes for example - this plot shows the log fold-change on the x-axis against the -log10 (adjusted) p-value on the y-axis. Significant genes can be highlighted in a different colour e.g. red for up-regulated genes and blue for down-regulated genes.