readJunctionSeqCounts {JunctionSeq}R Documentation

Read junctionSeq count files

Description

This function loads read-count data (usually produced by QoRTs) and compiles them into a JunctionSeqCountSet object.

This function is called internally by the runJunctionSeqAnalyses function, and thus for most purposes users should not need to call this function directly. It may be useful to advanced users performing non-standard analyses.

Usage

readJunctionSeqCounts(countfiles, countdata,
    samplenames,  design,
    flat.gff.file, 
    test.formula1 = formula(~ sample + countbin + condition : countbin),
    analysis.type = c("junctionsAndExons","junctionsOnly","exonsOnly"),
    nCores = 1,
    use.exons, use.junctions, 
    use.known.junctions = TRUE, use.novel.junctions = TRUE,
    use.multigene.aggregates = FALSE,
    gene.names,
    verbose = TRUE,
    method.countVectors = c("geneLevelCounts","sumOfAllBinsForGene",
                            "sumOfAllBinsOfSameTypeForGene"),
    noDESeqMatrix = FALSE
)

Arguments

countfiles

Character vector. The filenames of the count files generated by QoRTs. The counts must all be generated using equivalent QoRTs parameters. The strandedness must be the same, as well as the inclusion of novel junctions.

countdata

List. An alternative parameterization. Instead of supplying count files using the countfiles parameter, you can pass a list of data frames, one for each sample. Each data frame should contain two columns: the first should be the feature id and the second should be the counts. This list must have the same length as the samplenames parameter.

samplenames

Character vector. A vector of full sample names, in the same order as the countfiles parameter.

design

A data frame containing the condition variable and all desired covariates. All variables should be factors.

flat.gff.file

Character string. The filename of the "flat" gff annotation file. Can be gzip-compressed. This "flat" gff file must be produced by the QoRTs jar utility using the makeFlatGtf or mergeNovelSplices functions (depending on whether inclusion of novel splice junctions is desired).

NOTE: This option is technically optional, but strongly recommended. If it is not included, then attempts to plot the results will crash unless (non-default) options are used to deactivate the plotting of genomic coordinates and transcript information

test.formula1

For advanced users. The base formula for the alternate hypothesis model used in the hypothesis tests.

NOTE: the biological condition to be tested must be named "condition".

analysis.type

Character string. One of "junctionsAndExons", "junctionsOnly", or "exonsOnly". This parameter determines what type of analysis is to be performed. By default JunctionSeq tests both splice junction loci and exonic regions for differential usage (a "hybrid" analysis). This parameter can be used to limit analyses specifically to either splice junction loci or exonic regions.

nCores

The number of cores to use. Note that multicore functionality may not be available on all platforms.

use.exons

Logical value. This is an alternate parameterization of the analysis.type parameter. If TRUE, then exonic region loci will be included in the analyses and will be tested for differential usage. If this parameter is set, then parameter use.junctions must also be set.

use.junctions

Logical value. This is an alternate parameterization of the analysis.type parameter. If TRUE, then splice junction loci will be included in the analyses and will be tested for differential usage. If this parameter is set, then parameter use.exons must also be set.

use.known.junctions

Logical value. If TRUE, then known splice junctions will not be filtered out prior to analysis. Note: this is overidden if use.junctions is FALSE or if analysis.type is set to "exonsOnly".

use.novel.junctions

Logical value. If TRUE, then novel splice junctions will not be filtered out prior to analysis. Note: this is overidden if use.junctions is FALSE or if analysis.type is set to "exonsOnly".

use.multigene.aggregates

Logical value. Whether to attempt to test "aggregate genes" which consist of multiple genes that overlap with one another. Note that inclusion of aggregate genes may affect the false discovery rate, since by their very nature aggregate genes will often show differential splice junction usage, as the two genes will often be regulated independently.

gene.names

data.frame. This optional parameter can be used to decoder the gene id's used in the actual analysis into gene symbols or gene names for general readability. This must be a text file name or data.frame with two columns of character strings. The first must be the gene ID's, and the second must be the gene names (as you wish them to appear in the plots). Genes are allowed to have multiple gene names, in which case they will be separated by commas. The gene names will be used in the plots and figures.

verbose

if TRUE, send debugging and progress messages to the console / stdout.

method.countVectors

Character string. Can be used to apply alternative methodologies or implementations. Intended for advanced users who have strong opinions about the underlying statistical methodologies.

Determines the type of count vectors to be used in the model framework. By default JunctionSeq compares the counts for a specific feature against the counts across the rest of the gene minus the counts for the specific feature. Alternatively, the sum of all other features on the gene can be used, like in DEXSeq. The advantage to the default JunctionSeq behavior is that no read or read-pair is ever counted more than once in any model. Under DEXSeq, some reads may cover many exonic segments and thus be counted repeatedly.

noDESeqMatrix

Suppresses the internal generation of a DESeq2 object. Depending on the options used this will break many downstream steps, and is for advanced usage only.

Value

A JunctionSeqCountSet.

Examples

########################################
#Set up example data:
decoder.file <- system.file(
                  "extdata/annoFiles/decoder.bySample.txt",
                  package="JctSeqData");
decoder <- read.table(decoder.file,
                  header=TRUE,
                  stringsAsFactors=FALSE);
gff.file <- system.file(
            "extdata/tiny/withNovel.forJunctionSeq.gff.gz",
            package="JctSeqData");
countFiles <- system.file(paste0("extdata/tiny/",
     decoder$sample.ID,
     "/QC.spliceJunctionAndExonCounts.withNovel.forJunctionSeq.txt.gz"),
     package="JctSeqData");
########################################
#Advanced Analysis:

#Make a "design" dataframe:
design <- data.frame(condition = factor(decoder$group.ID));
#Read the QoRTs counts.
jscs = readJunctionSeqCounts(countfiles = countFiles,
           samplenames = decoder$sample.ID,
           design = design,
           flat.gff.file = gff.file
);
## -> STARTING readJunctionSeqCounts (Thu Mar 30 17:21:24 2017)
## ---> RJSC; (v1.5.4)
## ---> RJSC: samplenames: SAMP1,SAMP2,SAMP3,SAMP4,SAMP5,SAMP6
## ---> RJSC: flat.gff.file: /home/hartleys/software-SL6/R/R3.2.3/lib/JctSeqData/extdata/tiny/withNovel.forJunctionSeq.gff.gz
## ---> RJSC: use.exons:TRUE
## ---> RJSC: use.junctions:TRUE
## ---> RJSC: use.novel.junctions:TRUE
## ---> File read complete.
## ---> Extracted counts. Found 1785 features so far.
## ---> Extracted gene-level counts. Found: 120 genes and aggregate-genes.
## ---> Removed gene features. Found: 1665 features to be included so far.
## ---> Note: 149 counting bins from overlapping genes
## --->          There are 3 multigene aggregates.
## --->          There are 8 genes that are part of an aggregate.
## ---> Removed multigene-aggregate features. Found: 1516 features to be included so far.
## ---> Final feature count: 1516 features to be included in the analysis.
## ---> Extracted feature counts.
## ---> counts complete.
## -----> reading annotation...
## -----> formatting annotation...
## -----> initial generation...
## -----> creating jscs...
## -----> generating count vectors... (Thu Mar 30 17:21:24 2017)
## > Using single-core execution.
##     getAllJunctionSeqCountVectors: dim(counts) = 1516,6 (Thu Mar 30 17:21:25 2017)
##     getAllJunctionSeqCountVectors: dim(gct) = 120,6
##     getAllJunctionSeqCountVectors: out generated. dim = 1516,12 (Thu Mar 30 17:21:25 2017)
## -----> count vectors generated (Thu Mar 30 17:21:25 2017)
## -----> generating DESeqDataSet... (Thu Mar 30 17:21:25 2017)
## -----> DESeqDataSet generated (Thu Mar 30 17:21:25 2017)

[Package JunctionSeq version 1.5.4 Index]