Processing of aligned RNA-Seq data

The first step is to process the aligned RNA-Seq data. The bulk of the data-processing is performed by the QoRTs.jar java utility. This tool produces an array of output files, analyzing and tabulating the data in various ways. This utility requires about 10-20gb of RAM for most genomes, and takes roughly 4-7 minutes to process 1 million read-pairs.

java -jar /path/to/jarfile/QoRTs.jar QC \
                   mybamfile.bam \
                   transcriptAnnotationFile.gtf.gz \
                   /qc/data/dir/path/

In the above command (which must be entered as a single line), you must replace /path/to/jarfile/ with the file-path to the directory in which the jar file is kept. The path /qc/data/dir/path/ should be replaced with the path you want the QC data to be written. This should match the path located in the decoder in the qc.data.dir column for this sample-run.

The bam processing tool includes numerous options. A full description of these options can be found in the online documentation of the jar utility⁶, or by entering the command:

java -jar /path/to/jarfile/QoRTs.jar QC --man

There are a number of crucial points that require attention when using the QoRTs.jar QC command.

MAPQ filter: Most aligners designed for RNA-Seq use the MAPQ field to define whether or not a read is multi-mapped. However, different aligners use different conventions. By default QoRTs assumes the RNA-STAR convention in which a MAPQ of 255 indicates a unique alignment. For TopHat, the --minMAPQ parameter must be set to 50. For GSNAP, the MAPQ behavior is not well documented (or at least I have been unable to find such documentation), but a MAPQ filtering threshold of 30 appears to work.
Stranded Data: By default, QoRTs assumes that the data is NOT strand-specific. For strand-specific data, the --stranded option must be used.
Stranded Library Type: The --fr_secondStrand option may be required depending on the stranded library type. QoRTs does not attempt to automatically detect the platform and protocol used for stranded data. There are two types of strand-specific protocols, which are described by the TopHat/CuffLinks documentation at http://cufflinks.cbcb.umd.edu/manual.html#library as fr-firststrand and fr-secondstrand. In HTSeq, these same library type options are defined as -s reverse and -s yes respectively. According to the CuffLinks manual, fr-firststrand (the default used by QoRTs for stranded data) applies to dUTP, NSR, and NNSR protocols, whereas fr-secondstrand applies to "Directional Illumina (ligation)" and "Standard SOLiD" protocols. If you are unsure which library type applies to your dataset, don't worry: one of the tests will report stranded library type. If you use this test to determine library type, be aware that you may have to re-run QoRTs with the correct library type set.
Read Groups: Depending on the production pipeline, each biological sample may be run across multiple sequencer lanes. These seperate files can be merged together either before or after analysis with QoRTs (and maybe even before alignment). However, if the merger occurs before analysis with QoRTs, then each bam file will consist of multiple seperate lanes or runs. In this case, it is STRONGLY recommended that seperate QC runs be performed on each "read group", using the --readGroup option. This will prevent run- or lane-specific biases, artifacts, or errors from being obfuscated.
Read Sorting: For paired-end data reads must be sorted. By default, QoRTs can accept files sorted by name OR by position. Technically QoRTs will accept unsorted data, but the memory usage will be greatly increased.
Single-end vs paired-end: By default, QoRTs assumes the input bam file consists of paired-end data. For single-end data, the --singleEnded option must be used.

For example, to read the first read group bam-file for SAMP1 from the example dataset (which is stranded, coordinate-sorted, and uses the fr_firstStrand stranded library type), one would use the following command:

java -jar /path/to/jarfile/QoRTs.jar QC \
                   --stranded \
                   inputData/bamFiles/SAMP1_RG1.bam \
                   inputData/annoFiles/anno.gtf.gz \
                   outputData/qortsData/SAMP1_RG1/

This command must be run on each bam file (and possibly more than once on each, if each bam file consists of multiple separate read-groups).

Memory Usage

Memory usage: The QoRTs QC utility requires at least 4gb or RAM for most genomes / datasets. Larger genomes, genomes with more annotated genes/transcripts, or larger bam files may require more RAM. You can set the maximum amount of RAM allocated to the JVM using the options -Xmx4000M. This should be included before the -jar in the command line. For example:

#Set the maximum to the minimum recommended 4 gigabytes:
java -Xmx4000M -jar /path/to/jarfile/QoRTs.jar QC \
                   --stranded \
                   inputData/bamFiles/SAMP1_RG1.bam \
                   inputData/annoFiles/anno.gtf.gz \
                   outputData/qortsData/SAMP1_RG1/

#Or Set the maximum to 16 gigabytes:
java -Xmx16G -jar /path/to/jarfile/QoRTs.jar QC \
                   --stranded \
                   inputData/bamFiles/SAMP1_RG1.bam \
                   inputData/annoFiles/anno.gtf.gz \
                   outputData/qortsData/SAMP1_RG1/

This option can be used with any and all of the QoRTs java utilities.

Next: Visualization Up: QoRTs Package User Manual Previous: Dataset Organization Contents

Dr Stephen William Hartley 2016-01-28