The first step is to process the aligned RNA-Seq data. The bulk of the data-processing is performed by the QoRTs.jar java utility. This tool produces an array of output files, analyzing and tabulating the data in various ways. This utility requires about 10-20gb of RAM for most genomes, and takes roughly 4-7 minutes to process 1 million read-pairs.
java -jar /path/to/jarfile/QoRTs.jar QC \ mybamfile.bam \ transcriptAnnotationFile.gtf.gz \ /qc/data/dir/path/
In the above command (which must be entered as a single line), you must replace /path/to/jarfile/
with the file-path to the directory in which the jar file is kept. The path /qc/data/dir/path/
should be replaced with the path you want the QC data to be written. This should match the path located in the decoder in the qc.data.dir column for this sample-run.
The bam processing tool includes numerous options. A full description of these options can be found in the online documentation of the jar utility6, or by entering the command:
java -jar /path/to/jarfile/QoRTs.jar QC --man
There are a number of crucial points that require attention when using the QoRTs.jar QC command.
--minMAPQ
parameter must be set to 50. For GSNAP, the MAPQ behavior is not well documented (or at least I have been unable to find such documentation), but a MAPQ filtering threshold of 30 appears to work.
--stranded
option must be used.
--fr_secondStrand
option may be required depending on the stranded library type. QoRTs does not attempt to automatically detect the platform and protocol used for stranded data. There are two types of strand-specific protocols, which are described by the TopHat/CuffLinks documentation at http://cufflinks.cbcb.umd.edu/manual.html#library as fr-firststrand
and fr-secondstrand
. In HTSeq, these same library type options are defined as -s reverse
and -s yes
respectively. According to the CuffLinks manual, fr-firststrand
(the default used by QoRTs for stranded data) applies to dUTP, NSR, and NNSR protocols, whereas fr-secondstrand
applies to "Directional Illumina (ligation)" and "Standard SOLiD" protocols. If you are unsure which library type applies to your dataset, don't worry: one of the tests will report stranded library type. If you use this test to determine library type, be aware that you may have to re-run QoRTs with the correct library type set.
--readGroup
option. This will prevent run- or lane-specific biases, artifacts, or errors from being obfuscated.
--singleEnded
option must be used.
For example, to read the first read group bam-file for SAMP1 from the example dataset (which is stranded, coordinate-sorted, and uses the fr_firstStrand
stranded library type), one would use the following command:
java -jar /path/to/jarfile/QoRTs.jar QC \ --stranded \ inputData/bamFiles/SAMP1_RG1.bam \ inputData/annoFiles/anno.gtf.gz \ outputData/qortsData/SAMP1_RG1/
This command must be run on each bam file (and possibly more than once on each, if each bam file consists of multiple separate read-groups).
Memory usage: The QoRTs QC utility requires at least 4gb or RAM for most genomes / datasets. Larger genomes, genomes with more annotated genes/transcripts, or larger bam files may require more RAM. You can set the maximum amount of RAM allocated to the JVM using the options -Xmx4000M
. This should be included before the -jar
in the command line. For example:
#Set the maximum to the minimum recommended 4 gigabytes: java -Xmx4000M -jar /path/to/jarfile/QoRTs.jar QC \ --stranded \ inputData/bamFiles/SAMP1_RG1.bam \ inputData/annoFiles/anno.gtf.gz \ outputData/qortsData/SAMP1_RG1/ #Or Set the maximum to 16 gigabytes: java -Xmx16G -jar /path/to/jarfile/QoRTs.jar QC \ --stranded \ inputData/bamFiles/SAMP1_RG1.bam \ inputData/annoFiles/anno.gtf.gz \ outputData/qortsData/SAMP1_RG1/
This option can be used with any and all of the QoRTs java utilities.