next up previous contents
Next: Processing of aligned RNA-Seq Up: QoRTs Package User Manual Previous: Quick Start   Contents

Subsections


Dataset Organization

Several QoRTs functions will require "decoder" information in some form, which describes each sample and all of its technical replicates (if any). The simplest method is to provide a decoder file. All of the columns are optional except for unique.ID, however if group, lane, and/or technical replicate information is not supplied then QoRTs obviously will not be able to produce plots that organized and/or grouped by these factors.

Fields:

In addition, the decoder can contain any other additional columns as desired, as long as all of the column names are distinct.

While QoRTs is primarily designed to allow comparisons between biological groups, lanes, sequencing-runs, etc, it can also be used on simpler datasets, or even individual samples. Thus, only the unique.ID variable is actually required. For testing purposes, you can produce a completed decoder (with all default values filled in) using the completeAndCheckDecoder function.

The simplest example would just be a character vector of unique.ID's:

completeAndCheckDecoder(c("SAMPLE1","SAMPLE2","SAMPLE3"));
##   unique.ID sample.ID lane.ID group.ID qc.data.dir
## 1   SAMPLE1   SAMPLE1 UNKNOWN  UNKNOWN     SAMPLE1
## 2   SAMPLE2   SAMPLE2 UNKNOWN  UNKNOWN     SAMPLE2
## 3   SAMPLE3   SAMPLE3 UNKNOWN  UNKNOWN     SAMPLE3

Alternatively, any of the optional fields can be included or left out, as desired:

incompleteDecoder <- data.frame(unique.ID = c("SAMPLE1", "SAMPLE2"),
                                group.ID = c("CASE","CONTROL"));
completeAndCheckDecoder(incompleteDecoder);
##   unique.ID group.ID sample.ID lane.ID qc.data.dir
## 1   SAMPLE1     CASE   SAMPLE1 UNKNOWN     SAMPLE1
## 2   SAMPLE2  CONTROL   SAMPLE2 UNKNOWN     SAMPLE2


Example data

The separate R package QoRTsExampleData contains an example dataset with an example decoder:

directory <- paste0(system.file("extdata/", package="QoRTsExampleData",
                         mustWork=TRUE),"/");
decoder.file <- system.file("extdata/decoder.txt",
                            package="QoRTsExampleData",
                            mustWork=TRUE);
decoder.data <- read.table(decoder.file,
                           header=T,
                           stringsAsFactors=F);
print(decoder.data);
##    sample.ID lane.ID unique.ID  qc.data.dir group.ID input.read.pair.count
## 1      SAMP1      L1 SAMP1_RG1 ex/SAMP1_RG1     CASE                465298
## 2      SAMP1      L2 SAMP1_RG2 ex/SAMP1_RG2     CASE                472241
## 3      SAMP1      L3 SAMP1_RG3 ex/SAMP1_RG3     CASE                500691
## 4      SAMP2      L1 SAMP2_RG1 ex/SAMP2_RG1     CASE                461405
## 5      SAMP2      L2 SAMP2_RG2 ex/SAMP2_RG2     CASE                467713
## 6      SAMP2      L3 SAMP2_RG3 ex/SAMP2_RG3     CASE                492322
## 7      SAMP3      L1 SAMP3_RG1 ex/SAMP3_RG1     CASE                485397
## 8      SAMP3      L2 SAMP3_RG2 ex/SAMP3_RG2     CASE                489859
## 9      SAMP3      L3 SAMP3_RG3 ex/SAMP3_RG3     CASE                516906
## 10     SAMP4      L1 SAMP4_RG1 ex/SAMP4_RG1     CTRL                460968
## 11     SAMP4      L2 SAMP4_RG2 ex/SAMP4_RG2     CTRL                468391
## 12     SAMP4      L3 SAMP4_RG3 ex/SAMP4_RG3     CTRL                484530
## 13     SAMP5      L1 SAMP5_RG1 ex/SAMP5_RG1     CTRL                469884
## 14     SAMP5      L2 SAMP5_RG2 ex/SAMP5_RG2     CTRL                475001
## 15     SAMP5      L3 SAMP5_RG3 ex/SAMP5_RG3     CTRL                494213
## 16     SAMP6      L1 SAMP6_RG1 ex/SAMP6_RG1     CTRL                452429
## 17     SAMP6      L2 SAMP6_RG2 ex/SAMP6_RG2     CTRL                458810
## 18     SAMP6      L3 SAMP6_RG3 ex/SAMP6_RG3     CTRL                477751

Due to size constraints the example dataset contained in this R package includes only the QC output data, not the raw bam-files themselves. The actual bamfiles, along with a step-by-step example walkthrough that covers the entire analysis pipeline, are linked to from the QoRTs github website (https://github.com/hartleys/QoRTs).

The example dataset is derived from a set of rat pineal gland samples, which were multiplexed and sequenced across six sequencer lanes. For the sake of simplicity, the example dataset was limited to only six samples and three lanes. However, the bam files alone would still occupy 18 gigabytes of disk space, which would make it unsuitable for distribution as an example dataset. To further reduce the example bamfile sizes, only reads that mapped to chromosomes chr14, chr15, chrX, and chrM were included. Additionally, all the selected chromosomes EXCEPT for chromosome 14 were randomly downsampled to 30 percent of their original read counts.

THIS DATASET IS INTENDED FOR DEMONSTRATION AND TESTING PURPOSES ONLY. Due to the various alterations that have been made to reduce file sizes and improve portability, it is not representitive of the original data and as such is really not suitable for any actual analyses.


next up previous contents
Next: Processing of aligned RNA-Seq Up: QoRTs Package User Manual Previous: Quick Start   Contents
Dr Stephen William Hartley 2016-06-14