Working on NGS data on an HPC cluster

For this practical, we will begin with a quick primer on how to use an HPC cluster, specifically the BMRC Cluster.
While previously you will have run bioinformatics tools (e.g. SAMtools) directly in the command line terminal,
you will now be submitting jobs to the Univa Grid Engine (UGE, formally Sun Grid Engine) which will then send the jobs to be executed on compute nodes.
Software will be accessed via modules.

Let's get started!

Getting into your workspace:

If you're greeted with a long wall of text, congratulations, you're in! (aren't you glad you set this all up on Friday?)
You are now in your home directory (to see the full path use pwd).

Move onto your working directory in /well/mscgenmed/ where more space is available, type:

Go to your workspace and make sure you have write access by writing: touch hello.txt

Using the BMRC cluster:

Before we start with the genomics, some required reading.
The information here will be useful for both this session 1 (today) and session 3 (on Bulk RNA sequencing) of the module (you should definitely bookmark it).

Read up to (and including) the section about CLUSTER QUEUES AND NODES:
Using the BMRC Cluster Homepage
(Note: the graphic showing the BMRC cluster's arquitecture mentions rescomp1|rescomp2. Mentally replace these with cluster1.bmrc|cluster2.bmrc)

Submitting a BASH script to the scheduler:

We are now ready to get started with our genomics on the cluster!

As mentioned at the top, we will not be running software on the login node and to start adopting good habits right from the start,
we will copy all the files we need for this section using compute nodes.
The shared folder contains a folder called family_trio which we would like to copy the contents of to our directory. This would typically be done by writing:

We are going to write the above within a BASH shell script and submit it as a job to the scheduler.

Create a new file copyscript.sh in vim

To do this, write the following:

vim copyscript.sh
and then press i to start editing, writing the following:


#!/bin/bash

#$ -cwd



cp -r ../../shared/family_trio family_trio

When you are done, hit the escape key, followed by :wq which will appear at the bottom of your screen and means vim is writing to the aforementioned file and quitting.
You will be back where you started but with a new script to run! (you can check with cat copyscript.sh)

Submitting your file to the scheduler with qsub:

In order to submit a job to the queue you will have to specify the type of queue using -q and the 'project' (the MSc) under which you're asking for resources using -P.
Once you have chosen the appropriate parameters, you are ready to submit your script. This is what it should look like:

The scheduler will know to look for input relative to the current working directory and where to write output based on the #$ -cwd line we included in the script earlier.
Otherwise, it would assume the home directory as the starting point for relative paths, which we do not want.
Note that you can also write your scheduler parameters within the script at the start thusly:


#!/bin/bash

#$ -cwd

#$ -q short.qc

#$ -P mscgenmed.prjc

Run your first job now using the above command or its variant.

Once your script is running you should see a standard out and standard error file with the name of your script, the letter o or e, and the job's run number.
Any text output that hasn't been redirected to a file will end up in the standard out file (though in this case there isn't any). If an error arises, it will go in the standard error file.
If the error file is not empty and there's therefore been an error, flag down one of the assistants.

Read the section on CHECKING OR DELETING RUNNING JOBS:
Using the BMRC Cluster Homepage
This will provide the information you need to check the status of your job qstat and how to delete jobs qdel.

Submitting more than one alignment at a time for variant calling

Now that you have been shown how to write a script to copy some files over, write one such script to run a BCFtools joint variant call on our family trio.
The compute cluster is a resource shared by many researchers, several of whom will occupy the same set of programs. It would therefore make no sense to install different copies of the same program.
This is why we have modules, allowing access to software via aliases. We shall now load BCFtools.

We can see all available modules using:

This is quickly overwhelming. A little trick to find the program you're after is to type:

Then start typing the first letters of your program's name (i.e. BCF) and press tab twice to autocomplete and see options.
(you'll also need to give this a second or two...)

Load the most up-to-date version of BCFtools, to make sure you did this properly:

Does this mean you can use bcftools in all of your scripts now? Not so fast! BCFtools is loaded onto the login node!
This will not affect what is loaded on a computer node, you therefore need to load your module from within your script!

Inspecting the VCF:

Running GATK HaplotypeCaller

We will now move away from SAMtools, towards the gold-standard of large-scale NGS analysis projects, GATK.
Our (shared) goal is to call variants in 100 samples in such a way that we are able to combine these results and generate new genotyping results.
Each one of us has been assigned six BAM files (and their respective BAI index files).

Locate your files in exomes_gatk.

This time, you are not expected to transfer them, but to point to them. The output should also go to this folder for future merging.

Have a look at GATK's documentation for Germline short variant discovery in cohort data:
Main steps for Germline Cohort Data

For the remainder of this practical we will focus on the single-sample HaplotypeCaller step.


gatk --java-options "-Xmx2G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \ 

      HaplotypeCaller \ 

      -R $1 \ 

      -I $2 \ 

      -O ${2%%.bam}.g.vcf.gz \ 

      -ERC GVCF \

Try inserting the code above in a script and running it.
You can put your script in a for loop so that it runs for each of the 6 samples.
Alternatively you can set up an array of jobs as shown below.

You can check the progress of your GATK job in the standard error.
Have you noticed how the latter chromosomes seem to be traversed faster? Why is that? (hint: Nothing to do with GATK per se)

Alternative to FOR loop: Using an array to launch multiple jobs

Read the section on ARRAY JOBS and DECLARING AN ARRAY JOB:
Using the BMRC Cluster Homepage

The following is what you'll need to add to your code.


#!/bin/bash 

#$ -cwd 

[...] 

#$ -S /bin/bash  (Add this line!)
 


[...]



array=(`ls ../../shared/exomes_gatk/username/*bam`) (Add this line to create an array of filenames from a list!)

bamfile=${array[$SGE_TASK_ID-1]} (Add this to select a file based on task ID)



gatk --java-options "-Xmx2G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" HaplotypeCaller \ 

[Replace $2 with bamfile in the body of this command]

when submitting an array job type -t 1-6 as one of your parameters. This will launch 6 jobs, each running on one of your files.

Though we won't do it here, another way we can speed up this process is to break each job into intervals (for example by big chunks of chromosome) since variants can be called between these regions independently.

EM1.1B: Working on NGS data on an HPC cluster [Practical]

17th of January 2023