Dataset Organization

Several QoRTs functions will require "decoder" information in some form, which describes each sample and all of its technical replicates (if any). The simplest method is to provide a decoder file. All of the columns are optional except for unique.ID, however if group, lane, and/or technical replicate information is not supplied then QoRTs obviously will not be able to produce plots that organized and/or grouped by these factors.

Fields:

unique.ID: A unique identifier for the row. QoRTs will also accept the synonym "lanebam.ID". THIS IS THE ONLY MANDATORY FIELD.
lane.ID: The ID of the lane or batch. By default this will be set to "UNKNOWN".
group.ID: The ID of the "group". For example: "Case" or "Control". By default this will be set to "UNKNOWN".
sample.ID: The ID of the biological sample from which the data originated. Each sample can have multiple rows, representing technical replicates (in which the same sample is sequenced on multiple lanes or runs). By default QoRTs will assume that every row comes from a separate sample, and will thus set the sample.ID to equal the unique.ID.
qc.data.dir: The directory in which the java utility saved all the output data. If this column does not exist, by default it will be set to be unique.ID.
input.read.pair.count: (Optional) The number of reads in the original fastq file, prior to alignment. If this field is left out, then QoRTs will skip comparisons and plotting of mapping rates. There are a number of other ways to input this value. See Section 8.4.21.
multi.mapped.read.pair.count: (Optional) The number of reads that were multi-mapped by the aligner. This field should only be used if filtering for multi-mapped reads is performed prior to analysis with QoRTs (which is not recommended). Even in this case, this field can simply be left out and QoRTs will skip plotting and comparisons of multi-mapping rates. See Section 8.4.21.

In addition, the decoder can contain any other additional columns as desired, as long as all of the column names are distinct.

While QoRTs is primarily designed to allow comparisons between biological groups, lanes, sequencing-runs, etc, it can also be used on simpler datasets, or even individual samples. Thus, only the unique.ID variable is actually required. For testing purposes, you can produce a completed decoder (with all default values filled in) using the completeAndCheckDecoder function.

The simplest example would just be a character vector of unique.ID's:

completeAndCheckDecoder(c("SAMPLE1","SAMPLE2","SAMPLE3"));

##   unique.ID sample.ID lane.ID group.ID qc.data.dir
## 1   SAMPLE1   SAMPLE1 UNKNOWN  UNKNOWN     SAMPLE1
## 2   SAMPLE2   SAMPLE2 UNKNOWN  UNKNOWN     SAMPLE2
## 3   SAMPLE3   SAMPLE3 UNKNOWN  UNKNOWN     SAMPLE3

Alternatively, any of the optional fields can be included or left out, as desired:

incompleteDecoder <- data.frame(unique.ID = c("SAMPLE1", "SAMPLE2"), group.ID = c("CASE","CONTROL")); completeAndCheckDecoder(incompleteDecoder);

##   unique.ID group.ID sample.ID lane.ID qc.data.dir
## 1   SAMPLE1     CASE   SAMPLE1 UNKNOWN     SAMPLE1
## 2   SAMPLE2  CONTROL   SAMPLE2 UNKNOWN     SAMPLE2

Example data

The separate R package QoRTsExampleData contains an example dataset with an example decoder:

directory <- paste0(system.file("extdata/", package="QoRTsExampleData", mustWork=TRUE),"/"); decoder.file <- system.file("extdata/decoder.txt", package="QoRTsExampleData", mustWork=TRUE); decoder.data <- read.table(decoder.file, header=T, stringsAsFactors=F); print(decoder.data);

##    sample.ID lane.ID unique.ID  qc.data.dir group.ID input.read.pair.count
## 1      SAMP1      L1 SAMP1_RG1 ex/SAMP1_RG1     CASE                465298
## 2      SAMP1      L2 SAMP1_RG2 ex/SAMP1_RG2     CASE                472241
## 3      SAMP1      L3 SAMP1_RG3 ex/SAMP1_RG3     CASE                500691
## 4      SAMP2      L1 SAMP2_RG1 ex/SAMP2_RG1     CASE                461405
## 5      SAMP2      L2 SAMP2_RG2 ex/SAMP2_RG2     CASE                467713
## 6      SAMP2      L3 SAMP2_RG3 ex/SAMP2_RG3     CASE                492322
## 7      SAMP3      L1 SAMP3_RG1 ex/SAMP3_RG1     CASE                485397
## 8      SAMP3      L2 SAMP3_RG2 ex/SAMP3_RG2     CASE                489859
## 9      SAMP3      L3 SAMP3_RG3 ex/SAMP3_RG3     CASE                516906
## 10     SAMP4      L1 SAMP4_RG1 ex/SAMP4_RG1     CTRL                460968
## 11     SAMP4      L2 SAMP4_RG2 ex/SAMP4_RG2     CTRL                468391
## 12     SAMP4      L3 SAMP4_RG3 ex/SAMP4_RG3     CTRL                484530
## 13     SAMP5      L1 SAMP5_RG1 ex/SAMP5_RG1     CTRL                469884
## 14     SAMP5      L2 SAMP5_RG2 ex/SAMP5_RG2     CTRL                475001
## 15     SAMP5      L3 SAMP5_RG3 ex/SAMP5_RG3     CTRL                494213
## 16     SAMP6      L1 SAMP6_RG1 ex/SAMP6_RG1     CTRL                452429
## 17     SAMP6      L2 SAMP6_RG2 ex/SAMP6_RG2     CTRL                458810
## 18     SAMP6      L3 SAMP6_RG3 ex/SAMP6_RG3     CTRL                477751

Due to size constraints the example dataset contained in this R package includes only the QC output data, not the raw bam-files themselves. The actual bamfiles, along with a step-by-step example walkthrough that covers the entire analysis pipeline, are linked to from the QoRTs github website (https://github.com/hartleys/QoRTs).

The example dataset is derived from a set of rat pineal gland samples, which were multiplexed and sequenced across six sequencer lanes. For the sake of simplicity, the example dataset was limited to only six samples and three lanes. However, the bam files alone would still occupy 18 gigabytes of disk space, which would make it unsuitable for distribution as an example dataset. To further reduce the example bamfile sizes, only reads that mapped to chromosomes chr14, chr15, chrX, and chrM were included. Additionally, all the selected chromosomes EXCEPT for chromosome 14 were randomly downsampled to 30 percent of their original read counts.

THIS DATASET IS INTENDED FOR DEMONSTRATION AND TESTING PURPOSES ONLY. Due to the various alterations that have been made to reduce file sizes and improve portability, it is not representitive of the original data and as such is really not suitable for any actual analyses.

Next: Processing of aligned RNA-Seq Up: QoRTs Package User Manual Previous: Quick Start Contents

Dr Stephen William Hartley 2016-01-28