RNA-Seq Analysis and Visualisation
Introduction
This document demonstrates RNA-Seq differential expression analysis, focusing on visualisation techniques like box plots, violin plots, and heatmaps. Exercises are embedded throughout to encourage active learning and exploration of the data.
Required Packages
Let's ensure the required packages are installed before running the analysis:
install.packages( 'tidyverse' )
install.packages( 'gplots' )
install.packages( 'readxl' )
If you have trouble installing these, you can try using BiocManager to install instead:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("tidyverse", "gplots", "readxl"))
Now let's load the packages:
library(tidyverse)
library(gplots)
library(readxl)
Data Loading and Preprocessing
Description of the Data and file structure
DGE results were obtained from the analysis of the transcriptional responses to starvation in HeLa cells treated with nutrient-deprived media (HBSS) for 4 h and control cells grown in nutrient-rich media. Two independent batches with triplicate samples of each biological condition are shown. RNA-seq libraries were prepared using the mRNA-seq protocol of the Crown Genomics Institute of the Nancy and Stephen Grand Israel National Center for Personalized Medicine, Weizmann Institute of Science. GSE211066 - RNA-seq Dataset
Load the Data
First, navigate to the Data Accessibility part of the research paper and once you open the link, click and download the S1 dataset. Load the dataset from an Excel file. Ensure the path to your data file is correct. Depending on the type of your download, you may need to rename the file to "Galves_procb_ESM_Table_S1.xlsx".
data <- readxl::read_xlsx(
"Galves_procb_ESM_Table_S1.xlsx",
sheet = "res.HBSS.vs.Full_siContol",
na = "NA"
)
Use your R skills to examine this data now. For example what are the columns, and what are the rows, and how big is the data frame?
Rename Columns and Filter for Significant Genes
Let's rename the first column for easier handling:
colnames(data)[1] <- "gene_id"
Now, let's use the data verbs to and filter for genes with padj < 0.05, and find those with the lowest P-values:
colnames(data)[1] <- "gene_id"
# Filter for significant genes
significant_genes <- (
data
%>% filter( padj < 0.05 )
)
# Select top genes with the highest log2FoldChange
top_genes <- (
significant_genes
%>% arrange( desc(abs(log2FoldChange)) )
%>% head(10)
)
Data Transformation for Visualisation
The data you've downloaded is in wide format: that is, the data for each gene is spread out along a single row of the data frame, with different columns being the expression values in different study components (which in this data are the different conditions and replicates).
For many purposes it will be better to have data in a long format, that is, where each row represents one combination of a gene and study arm, like this:
gene_id gene.sym condition_rep Expression
ENSG00000013441 CLK1 Round1_Full_1 94.6
ENSG00000013441 CLK1 Round1_Full_2 84.9
ENSG00000013441 CLK1 Round1_Full_3 85.8
(etc.)
and so on.
Changing between various formats of data layout like this is generally known as pivoting, and luckily tidyverse
has a special set of functions that make it easy to convert between them! Let's try this now. First, make a list of
columns we want to pivot:
expression_columns = c(
"Round1_Full_1", "Round1_Full_2", "Round1_Full_3",
"Round1_HBSS_1", "Round1_HBSS_2", "Round1_HBSS_3",
"Round2_Full_1", "Round2_Full_2", "Round2_Full_3",
"Round2_HBSS_1", "Round2_HBSS_2", "Round2_HBSS_3"
)
Alternatively, since all these column names start with 'Round', you could use:
expression_columns = grep( "Round", colnames( data ), value = T )
But of course you should check carefully that you're getting the right list of colums here!
Now let's pivot:
data_long <- (
data
%>% select(
c( "gene_id", "gene.sym", expression_columns )
)
%>% pivot_longer(
cols = expression_columns,
names_to = "Condition_Rep",
values_to = "Expression"
)
)
Have a look at this long form data now. Can you see what it's done? Does it have the number of rows you expected?
This data is ok but it would also be nice to split out the 'condition' and 'replicate' from that third column. Once
again, there is a helpful function in the stringr package (part of tidyverse) to do this - it's called
str_extract(). It's job is to extract bits of string
columns:
data_long = (
data_long
%>% mutate(
Condition = stringr::str_extract( data_long$Condition_Rep, "Full|HBSS" ),
Batch = stringr::str_extract( data_long$Condition_Rep, "Round1|Round2" )
)
)
How does this work?
Remember that mutate() is just the operation to 'add extra columns to the data frame'.
Here we are using it to add Condition and Batch columns.
To get their values, we are using str_extract() to extract data from the Condition_Rep columns.
The expressions like "Full|HBSS" are examples of regular expressions. Here the vertical bar | just means 'or',
so the code says "extract any text that looks like "Full" or "HBSS" (but nothing else).
You don't need to know regular expression syntax for this course, but they are a very useful way of parsing and extracting pieces of text.
Histogram
Below is an example script to create a histogram showing the distribution of expression values. This can help you understand the overall range and frequency of expression levels in the dataset by Condition.
p = (ggplot(data_long, aes(x = Expression, fill = Condition)) +
geom_histogram(binwidth = 0.5, color = "black", alpha = 0.7) +
scale_x_log10() +
theme_minimal() +
labs(
title = "Histogram of Gene Expression Levels by Condition",
x = "Expression Level (Log10 Scale)",
y = "Frequency"
)
)
print(p)

Exercises:
- Adjust the
binwidthin thegeom_histogram()function to see how it affects the distribution. - Try using different fill colors and adding transparency (
alpha) to the bars. - Compare the histogram of expression levels between different conditions (e.g.,
Fullvs.HBSS) usingfacet_wrapaesthetics for comparison.
Box Plot
Below is a script to create a basic box plot for the top 3 genes.
p = (
ggplot(
data = data_long %>% filter(gene.sym %in% top_genes$gene.sym[1:3] ),
mapping = aes(x = Condition, y = Expression, fill = Condition)
)
+ geom_boxplot()
+ scale_y_log10()
+ facet_wrap( ~ gene.sym )
+ theme_minimal()
+ labs(
title = "Box Plot of Expression Levels for Selected Genes",
y = "Expression Level (Log10 Scale)",
x = "Condition"
)
)
print(p)

Exercises:
- Add jittered data points (
geom_jitter()) to the box plot, ensuring they don’t overlap excessively. - Facet the box plot by both
Batchandgene.sym. Hint: Usefacet_grid. - Choose three genes with the smallest
padjvalues and plot their box plots. - Modify the y-axis scale to show raw values instead of a log scale.
Violin Plot
Below is a script for violin plots with log-scaled y-axis.
p = (
ggplot(
data_long %>% filter(gene.sym %in% top_genes$gene.sym[1:3]),
aes(x = Condition, y = Expression, fill = Condition))
+ geom_violin(trim = FALSE)
+ scale_y_log10()
+ facet_wrap(~ gene.sym)
+ theme_minimal()
+ labs(
title = "Violin Plot of Expression Levels for Selected Genes",
y = "Expression Level (Log10 Scale)",
x = "Condition"
)
)
print(p)

Can you:
- Add jittered data points to the violin plot?
- Use a custom color palette for the conditions, e.g.,
scale_fill_brewer(palette = "Set2")? - Recreate the violin plot for the same genes selected in the box plot exercise? And then compare the two visualisations.
- Export the plot using
ggsavefunction.
Heatmaps
Below is a heatmap script displaying the top 20 genes by padj with row clustering only, note - this time we are saving
it to pdf file instead of showing it in a plotting tab.
# Prepare data for Heatmap for the Top 20 Significant Genes
top_20_genes <- significant_genes %>%
arrange(padj) %>%
head(20)
# Select expression columns for the top 20 genes and convert to matrix
heatmap_data <- top_20_genes %>%
select(starts_with("Round")) %>%
as.matrix()
# Assign row names as gene symbols for readability
rownames(heatmap_data) <- top_20_genes$gene.sym
# Replace NAs with 0
heatmap_data[is.na(heatmap_data)] <- 0
# Define colors and other heatmap parameters
heatmap_colors <- colorRampPalette(c("blue", "white", "red"))(50)
# Define conditions for each column
col_conditions <- ifelse(grepl("Full", colnames(heatmap_data)), "red", "blue")
# Open a PDF device
pdf("20_DGE_heatmap.pdf")
# Generate Heatmap
heatmap.2(heatmap_data,
trace = "none",
scale = "row",
dendrogram = "both",
ColSideColors = col_conditions,
col = heatmap_colors,
main = "Top 20 Differentially Expressed Genes",
margins = c(8, 10),
cexRow = 0.8,
cexCol = 0.8)
dev.off()

Exercises:
- Modify the heatmap script to show both row and column clustering.
- Decrease the font of the main title of the plot.
- Adjust the color scale to use
"green"to"black"to"magenta". - Select the 20 genes with the highest absolute
log2FoldChangeand plot their heatmap with row clustering only. - Change the heatmap's color palette to a diverging palette, such as viridis or RdYlBu.
GO-Terms Enrichment Analysis
Explore online tools to identify enriched or depleted pathways. Use filtered gene lists (e.g., padj < 0.05 and absolute log2FoldChange > 2).
- DAVID: Functional annotation clustering, gene-term enrichment analysis, visualisation tools.
- PANTHER: GO term enrichment, pathway analysis, functional classification, overrepresentation tests.
- ShinyGO: Intuitive web interface for GO enrichment and visualisation, integrates other pathway data.