ABITraceTCPeakCalculator
AggregatePlotter
AlignmentEndTrimmer
Alleler
AllelicExpressionDetector
AllelicMethylationDetector
AMD
BamIntensityJoiner
BamNMerIntensityParser
Bar2Gr
Bar2USeq
BaseClassifier
Bed2Bar
BedStats
BisSeq
BisSeqAggregatePlotter
BisSeqErrorAdder
BisStat
BisStatRegionMaker
CalculatePerCycleErrorRate
ChIPSeq
CHPCAligner
CompareIntersectingRegions
CompareParsedAlignments
ConcatinateFastas
CorrelatePointData
CountChromosomes
BisulfiteConvertFastas
CorrelationMaps
ConvertFastaA2G
ConvertFastqA2G
ConvertFasta2GCBoolean
ConvertFasta2GCBarGraph
DefinedRegionBisSeq
DefinedRegionDifferentialSeq
DefinedRegionRNAEditing
DefinedRegionScanSeqs
DRDSAnnotator
EnrichedRegionMaker
ElandMultiParser
ElandParser
ElandSequenceParser
ExportExons
ExportIntergenicRegions
ExportIntronicRegions
ExportTrimmedGenes
FetchGenomicSequences
FindNeighboringGenes
FindOverlappingGenes
FindSharedRegions
FileCrossFilter
FileMatchJoiner
FileJoiner
FileSplitter
FilterDuplicateAlignments
Graph2Bed
FilterIntersectingRegions
FilterPointData
GenerateOverlapStats
Gr2Bar
InosinePredict
IntersectLists
IntersectKeyWithRegions
IntersectRegions
KeggPathwayEnrichment
MaqSnps2Bed
MakeSpliceJunctionFasta
MakeTranscriptome
MaskExonsInFastaFiles
MaskRegionsInFastaFiles
MaxEntScanScore3
MaxEntScanScore5
MergeExonMetrics
MergePairedSamAlignments
MergePointData
MergeRegions
MergeUCSCGeneTable
MethylationArrayScanner
MethylationArrayDefinedRegionScanner
MicrosatelliteCounter
MiRNACorrelator
MultipleReplicaScanSeqs
MultiSampleVCFFilter
NovoalignBisulfiteParser
NovoalignIndelParser
NovoalignParser
NovoalignPairedParser
OligoTiler
OverdispersedRegionScanSeqs
ParseExonMetrics
ParseIntersectingAlignments
ParsePointDataContexts
PeakShiftFinder
PointDataManipulator
Primer3Wrapper
PrintSelectColumns
QCSeqs
Qseq2Fastq
RandomizeTextFile
RankedSetAnalysis
ReadCoverage
ReferenceMutator
RNAEditingPileUpParser
RNAEditingScanSeqs
RNASeq
RNASeqSimulator
Sam2Fastq
Sam2USeq
SamAlignmentExtractor
SamComparator
SamParser
SamTranscriptomeParser
SamFixer
SamReadDepthSubSampler
SamSVFilter
SamSubsampler
ScanSeqs
ShiftAnnotationPositions
SoapV1Parser
SubtractRegions
ScoreChromosomes
ScoreParsedBars
ScoreSequences
Sgr2Bar
Simulator
StrandedBisSeq
SRAProcessor
SubSamplePointData
Tag2Point
Text2USeq
TomatoFarmer
Telescriptor
UCSCBig2USeq
USeq2UCSCBig
USeq2Text
VCFAnnotator
VCFComparator
VCFReporter
VCFSpliceAnnotator
VCFTabix
Wig2Bar
Wig2USeq
ScoreMethylatedRegions
ScoreEnrichedRegions
**************************************************************************************
** ABI Trace TC Peak Calculator: July 2009 **
**************************************************************************************
Uses a sliding window to estimate the mean T peak area, compares it to the observed
T area for a given T or C to estimate the fraction T. Useful for calculating the
fraction of converted Cs from bisufite treated DNA in a methylation experiment.
Required Parameters:
-f Full path file text for tab delimited text ABI trace file.
-w Window size in bp for estimating mean T peak areas, defaults to 16.
-c Print only T and C bases, defaults to all.
Example: java -jar pathTo/Apps/ABITraceTCPeakCalculator -f /MyBisSeqData/exp1.txt -c
**************************************************************************************
**************************************************************************************
** Aggregate Plotter: August 2012 **
**************************************************************************************
Fetches point data contained within each region, inverts - stranded annotation, zeros
the coordinates, sums, and window averages the values. Usefull for generating
class averages from a list of annotated regions. Use a spreadsheet app to graph the
results.
Options:
-t PointData directories, full path, comma delimited. These should contain chromosome
specific xxx.bar.zip files.
-b Bed file (chr, start, stop, text, score, strand(+/-/.), full path, containing
regions to stack. Must be all the same size.
-p Peak shift, average distance between + and - strand peaks. Will be used to shift
the PointData by 1/2 the peak shift, defaults to 0.
-u Strand usage, defaults to 0 (combine), 1 (use only same strand), 2 (opposite
strand), or 3 (ignore).
this option to select particular stranded data to aggregate.
-r Replace scores with 1.
-d Delog2 scores. Do it if your data is in log2 space.
-v Convert each region scores to % of total.
-n Divide scores by the number of regions.
-s Scale all regions to a particular size. Defaults to max region size.
-a Average region scores instead of summing.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/AgregatePlotter -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -b /Anno/tssSites.bed -p 73 -u 1
**************************************************************************************
**************************************************************************************
** Alignment End Trimmer: April 2014 **
**************************************************************************************
This application can be used to trim alignments according to the density of mismatches.
Each base of the alignment is compared to the reference sequence from the start of the
alignment to the end. If the bases match, the score is increased by -m. If the bases
don't match, the score is decreased by -n. The alignment position with the highest
score is used as the new alignment end point. The cigar string, alignment position,
mpos and flags are all updated to reflect trimming.
Notes:
1) Insertions, deletions and skips are currently not counted as matches or mismatches
Required:
-i Path to the orignal alignment, sam/bam/sam.gz OK.
-r Path to the reference sequence, gzipped OK.
-o Name of the trimmed alignment output. Output is bam and bai.
Optional:
-m Score of match. Default 1
-n Score of mismatch. Default 2
-v Verbose output. This will write out detailed information for every trimmed read.
It is suggested to use this option only on small test files.
-l Min length. If the trimmed length is less than this value, the read is switched
to unaligned. Default 10bp
-e Turn on RNA Editing mode. A>G (forward reads) and T>C (reverse reads) are considered
matches.
-s Turn on mismatch scoring mode. Reads with more than -x mismatches are dropped. If
RNA Editing mode is on, A>G (forward reads) and T>C (reverse reads) are considered
matches.
-x Max number of mismatches allowed in max scoring mode. Default 0
Examples:
1) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.bam -o 100X1.trim.bam
-r /path/to/hg19.fasta
2) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.bam -o 100X1.trim.bam
-r /path/to/hg19.fasta -m 0.5 -n 3
3) java -Xmx4G -jar /path/to/AlignmentEndTrimmer -i 1000X1.test.bam
-o 100X1.test.trim.bam -r /path/to/hg19.fasta -v
**************************************************************************************
**************************************************************************************
** Alleler: Sept 2010 **
**************************************************************************************
Intersects a list of alleles (SNPs and INDELs) with gene models and returns their
affects on coding sequences and splice-junctions. Assumes interbase coordinates. If
ambiguious bases (ie R,Y,S,W,K,M) are provided the non-reference base is assumed.
Options:
-a Full path file text for a table of alleles.
-e Print an example of an allele table.
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables
-g Full path directory text containing fasta files for reference base calling
(e.g. chr1.fasta, chr5.fasta, ...).
-n Neighborhood to include in intergenic intersection, defaults to 1000
-d Print only non-synonymous and splice affector alleles, defaults to all.
-b Print results in bed format, defaults to detailed report.
-c Collapse multiple hits to the same gene producing the same variant.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/Alleler -a /APCSeq/apcFam7Alleles.txt
-u /Anno/ucscKnownGenes.txt -g /Anno/Hg18Fastas/ -n 5000 -d -b
**************************************************************************************
**************************************************************************************
** Allelic Expression Detector: August 2014 **
**************************************************************************************
Beta!
Required Options:
-n Sample names to process, comma delimited, no spaces.
-b Directory containing coordinate sorted bam and index files named according to their
sample name.
-d SNP data file containing all sample snp calls.
-r Results directory.
-s SNP map bed file from the ReferenceMutator app.
Default Options:
-g Minimum GenCall score, defaults to >= 0.2
-q Minimum alignment base quality at snp, defaults to 20
-c Minimum alignment read coverage, defaults to 4
Example: java -Xmx4G -jar pathTo/USeq/Apps/ beta!
**************************************************************************************
**************************************************************************************
** Allelic Methylation Detector: March 2014 **
**************************************************************************************
AMD identifies regions displaying allelic methylation, e.g. ~50% average mCG
methylation yet individual read pairs show a bimodal fraction distribution of either
fully methylated or unmethylated. Beta.
Options:
-s Save directory.
-f Fasta file directory.
-t BAM file directory containing one or more xxx.bam file with their associated xxx.bai
index. The BAM files should be sorted by coordinate and have passed Picard
validation.
-a Minimum number alignments per region, defaults to 15.
-e Minimum number Cs in each alignment, defaults to 6
-m Minimum region fraction methylation, defaults to 0.4
-x Maximum region fraction methylation, defaults to 0.6
-r Full path to R, defaults to /usr/bin/R
-c Converted CG context PointData directories, full path, comma delimited. These
should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. Use the ParsePointDataContexts on the output of the
NovoalignBisulfiteParser to select CG contexts.
-n Non-converted PointData directories, ditto.
-b Provide a bed file (chr, start, stop,...), full path, to scan a list of regions
instead of the genome. See, http://genome.ucsc.edu/FAQ/FAQformat#format1
Example: java -Xmx4G -jar pathTo/USeq/Apps/ beta!
**************************************************************************************
**************************************************************************************
** Allelic Methylation Detector: September 2012 **
**************************************************************************************
AMD identifies regions displaying allelic methylation, e.g. ~50% average mCG
methylation yet individual read pairs show a bimodal fraction distribution of either
fully methylated or unmethylated.
Options:
-s Save directory.
-f Fasta file directory.
-t BAM file directory containing one or more xxx.bam file with their associated xxx.bai
index. The BAM files should be sorted by coordinate and have passed Picard
validation.
-a Minimum number alignments per region, defaults to 15.
-e Minimum number Cs in each alignment, defaults to 6
-m Minimum region fraction methylation, defaults to 0.4
-x Maximum region fraction methylation, defaults to 0.6
-r Full path to R, defaults to /usr/bin/R
-c Converted CG context PointData directories, full path, comma delimited. These
should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. Use the ParsePointDataContexts on the output of the
NovoalignBisulfiteParser to select CG contexts.
-n Non-converted PointData directories, ditto.
-b Provide a bed file (chr, start, stop,...), full path, to scan a list of regions
instead of the genome. See, http://genome.ucsc.edu/FAQ/FAQformat#format1
Example: java -Xmx4G -jar pathTo/USeq/Apps/ beta!
**************************************************************************************
**************************************************************************************
** Bam Intensity Joiner : July 2013 **
**************************************************************************************
Extracts base level intensity information from the output of modified Picard
IlluminaBaseCallsToSam app and inserts this into an alignment file. Be sure to
syncronize the alignment output (e.g. -oSync in novoalign) so it is in the same order
as the intensity data.
Options:
-a Full path to sam/bam alignment file with header.
-i Full path to bam intensity file from running the modified IlluminaBasecallsToSam.
-r Full path bam file for saving the merged results.
-q Minimum mapping quality score. Defaults to 20, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. For RNA-Seq data from the SamTranscriptomeParser, set this to 0.
-s Maximum alignment score. Defaults to 240, smaller numbers are more stringent.
-m Filter for particular MD fields.
-u Sub sample data, printing only every XXX alignment.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/BamIntensityJoiner -u 10000 -m 101 -a
/Alignments/8341X.sam.gz -i /Ints/8341X.bam -r /Merged/8341.bam
**************************************************************************************
**************************************************************************************
** Bam NMer Intensity Parser : April 2012 **
**************************************************************************************
Parses a BAM file from a modified Picard IlluminaBaseCallsToSam run on raw Illumina
sequencing data to extract information regarding N mers.
Options:
-f Full path to a bam file or directory containing such. Multiple files are merged.
-r Full path file name to save results, defaults to a derivative of -f
-n Length of the N mer, defaults to 5.
-q Minimum base quality score, defaults to 20. Only N-mers where all bases pass the
threshold are scored.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/BamIntensityParser -f /Data/BamFiles/
-n 7 -q 30
**************************************************************************************
**************************************************************************************
** Bar2Gr: Nov 2006 **
**************************************************************************************
Converts xxx.bar to text xxx.gr files.
-f The full path directory/file text for your xxx.bar file(s).
Example: java -Xmx1500M -jar pathTo/T2/Apps/Bar2Gr -f /affy/BarFiles/
**************************************************************************************
**************************************************************************************
** Bar 2 USeq: Mar 2011 **
**************************************************************************************
Recurses through directories and sub directories of xxx.bar(.zip/.gz OK) files
converting them to xxx.useq files (http://useq.sourceforge.net/useqArchiveFormat.html).
Required Options:
-f Full path directory containing bar files or directories of bar files.
Default Options:
-i Index size for slicing split chromosome data (e.g. # rows per file),
defaults to 10000.
-r For graphs, select a style, defaults to 0
0 Bar
1 Stairstep
2 HeatMap
3 Line
-h Color, hexadecimal (e.g. #6633FF), enclose in quotations
-d Description, enclose in quotations
-g Reset genome version, defaults to that indicated by the bar files.
-e Delete original folders, use with caution.
-m Replace bar files with new xxx.useq file in bar file directory, use with caution.
Example: java -Xmx4G -jar pathTo/USeq/Apps/Bar2USeq -f
/AnalysisResults/ -i 5000 -h '#6633FF' -g D_rerio_Jul_2010
-d 'Final processed chIP-Seq results for Bcd and Hunchback, 30M reads'
**************************************************************************************
**************************************************************************************
** Base Classifier : Oct 2012 **
**************************************************************************************
Beta.
Options:
Example: java -Xmx1500M -jar pathTo/USeq/Apps/BamIntensityParser -f /Data/BamFiles/
-n 7 -q 30
**************************************************************************************
**************************************************************************************
** Bed2Bar: June 2010 **
**************************************************************************************
Bed2Bar builds stair step graphs from bed files for display in IGB. Strands are merged
and text information removed. Will also generate a merged bed file thresholding the
graph at that level.
-f Full path file or directory containing xxx.bed(.zip/.gz OK) files
-v Genome version (eg H_sapiens_Mar_2006), get from UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-s Sum bed scores for overlapping regions, defaults to assigning the highest score.
-t Threshold, defaults to 0.
-g Maximum gap, defaults to 0.
Example: java -Xmx4G pathTo/Apps/Bed2Bar -f /affy/res/zeste.bed.gz -v
M_musculus_Jul_2007 -g 1000 -s -t 100
**************************************************************************************
**************************************************************************************
** BedStats: June 2010 **
**************************************************************************************
Calculates several statistics on bed files where the name column contains a short read
sequence. This includes a read length distribution and frequencies of the 1st and last
bps. Can also trim your read to a particular length.
Options:
-b Full path file name for your alignment bed file or directory containing such. The
name column should contain your just you sequence or seq;qual .
-t Trim the 3' ends of your reads to the indicated length, defaults to not trimming.
-s Calculate base frequencies for the given 0 indexed base instead of the last base.
-r Reverse complement sequences before calculating stats and trimming.
Example: java -Xmx1500M -jar pathToUSeq/Apps/BedStats -b /Res/ex1.bed.gz -s 9 -t 10
**************************************************************************************
**************************************************************************************
** BisSeq: July 2013 **
**************************************************************************************
Takes two condition (treatment and control) PointData from converted and non-converted
C bisulfite sequencing data parsed using the NovoalignBisulfiteParser and scores
regions for differential methylation using either a fisher exact or chi-square test
for changes in methylation. A Benjamini & Hockberg correction is applied to convert
the pvalues to FDRs. Data is only collected on bases that meet the minimum
read coverage threshold in both datasets. The fraction differential methylation
statistic is calculated by taking the pseudomedian of all of the log2 paired base level
fraction methylations in a given window. Overlapping windows that meet both the
FDR and pseLog2Ratio thresholds are merged when generating enriched and reduced
regions. BisSeq generates several tracks for browsing and lists of differentially
methlated regions. To examine only mCG contexts, first filter your PointData using the
ParsePointDataContexts app.
Options:
-s Save directory, full path.
-c Treatment converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files fro the NBP app.
One can also provide a single directory that contains multiple PointData
directories.
-C Control converted PointData directories, ditto.
-n Treatment non-converted PointData directories, ditto.
-N Control non-coverted PointData directories, ditto.
-a Scramble control data.
Default Options:
-d Minimum per base read coverage, defaults to 5.
-w Window size, defaults to 250.
-m Minimum number reads in window, defaults to 5.
-f FDR threshold, defaults to 30 (-10Log10(0.01)).
-l Log2Ratio threshold, defaults to 1.585 (3x).
-r Full path to R, defaults to '/usr/bin/R'
-g Don't print graph files.
Example: java -Xmx10G -jar pathTo/USeq/Apps/BisStat -c /Sperm/Converted -n
/Sperm/NonConverted -C /Egg/Converted -N /Egg/NonConverted -s /Res/BisSeq
-w 500 -m 10 -l 2 -f 50
**************************************************************************************
**************************************************************************************
** Bis Seq Aggregate Plotter: October 2012 **
**************************************************************************************
BSAP merges bisulfite data over equally sized regions to generate data for class
average agreggate plots of fraction methylation. A smoothing window is also applied.
Data for unstranded, sense, and antisense are produced.
Options:
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. See the NovoalignBisulfiteParser app.
-n Non-converted PointData directories, ditto.
-b Bed file (tab delim: chr start stop name score strand(+/-/.)), full path.
-i Don't invert - stranded regions, defaults to inverting.
-s Scale all regions to a particular size. Defaults to scaling to max region size.
-m Calculate individual base fractions and then take a mean, ignoring zeros, over
the window, instead of summing the obs in the window and taking the fraction.
-o Minimum number of observations before scoring base fraction methylation, defaults
to 8.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/BisSeqAgregatePlotter -c
/NBP/Con -n /NBP/NonCon -b /Anno/tssSites.bed -m
**************************************************************************************
**************************************************************************************
** BisSeqErrorAdder: June 2012 **
**************************************************************************************
Takes PointData from converted and non-converted C bisulfite sequencing data parsed
using the NovoalignBisulfiteParser and simulates a worse non-coversion rate by
randomly picking converted observations and making them non-converted. This is
accomplished by first measuring the non-conversion rate in the test chromosome (e.g.
chrLambda), calculating the fraction of converted C's need to flip to non-converted
to reach the target fraction non-converted and then using this flip fraction
to modify the other chromosome data.
Options:
-s Save directory, full path.
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-n Non-converted PointData directories, ditto.
-f Target fraction non-converted for test chromosome, this cannot be less than the
current fraction.
-t Test chromosome, defaults to chrLambda* .
Example: java -Xmx12G -jar pathTo/USeq/Apps/BisSeqErrorAdder -c /Data/Sperm/Converted
-n /Data/Sperm/NonConverted -f 0.02
**************************************************************************************
**************************************************************************************
** BisStat: May 2014 **
**************************************************************************************
Takes PointData from converted and non-converted C bisulfite sequencing data parsed
using the NovoalignBisulfiteParser and generates several xxCxx context statistics and
graphs (bp and window level fraction converted Cs) for visualization in IGB.
BisStat estimates whether a given C is methylated using a binomial distribution where
the expect can be calculated using the fraction of non-converted Cs present in the
lambda data. Binomial p-values are converted to FDRs using the Benjamini & Hochberg
method. This app requires considerable RAM (10-64G).
Options:
-s Save directory, full path.
-c Converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-n Non-converted PointData directories, ditto.
-f Directory containing chrXXX.fasta(/.fa .zip/.gz OK) files for each chromosome.
Default Options:
-p Minimimal FDR for non-converted C's to be counted as methylated, defaults to 20 a
-10Log10(FDR = 0.01) conversion.
-e Expected fraction non-converted Cs due to partial bisulfite conversion and
sequencing error, defaults to 0.005 .
-l Use the unmethylated lambda alignment data to set the expected fraction of
non-converted Cs due to partial conversion and sequencing error. This is
predicated on including a 'chrLambda' fasta sequence while aligning your data.
-o Minimum read coverage to count mC fractions, defaults to 8
-w Window size, defaults to 1000.
-m Minimum number Cs passing read coverage in window to score, defaults to 5.
-r Full path to R, defaults to '/usr/bin/R'
-g Don't merge stranded data, defaults to running a non stranded analysis. Affects CG's.
-a First density quartile fraction methylation threshold, defaults to 0.25
-b Fourth density quartile fraction methylation threshold, defaults to 0.75
Example: java -Xmx12G -jar pathTo/USeq/Apps/BisStat -c /Data/Sperm/Converted -n
/Data/Sperm/NonConverted -s /Data/Sperm/BisSeq -w 5000 -m 10 -f
/Genomes/Hg18/Fastas -o 10
**************************************************************************************
**************************************************************************************
** BisStat Region Maker: March 2012 **
**************************************************************************************
Takes serialized window objects from BisStat, thresholds based on the min and max
fraction methylation params and prints regions in bed format meeting the criteria.
May also build regions base on the density of a given fraction methylation quartile.
For example, to identify regions where at least 0.8 of the sequenced Cs are low
methylated (<= 0.25 default settings in BisStat) set -q 1 -m 0.8 . To find regions of
with >= 0.9 of the Cs with high methylation (>= 0.75 default BisStat setting), set
-q 3 -m 0.9 .
Options:
-s SerializedWindowObject directory from BisStat, full path.
-m Minimum fraction.
-x Maximum fraction.
-g Maximum gap, defaults to 0.
-q Merge windows based on their quartile density score, not fraction methylation, by
indicating 1,2,or 3 for 1st, 2nd+3rd, or 4th, respectively.
Example: java -Xmx4G -jar pathTo/USeq/Apps/BisStatRegionMaker -m 0.8 -x 1.0 -g 100
-s /Data/BisStat/SerializedWindowObjects
**************************************************************************************
**************************************************************************************
** Calculate Per Cycle Error Rate : Feb 2013 **
**************************************************************************************
Calculates per cycle error rates provided a sorted indexed bam file and a fasta
sequence file. Only checks CIGAR M bases not masked or INDEL bases.
Required Options:
-b Full path to a coordinate sorted bam file (xxx.bam) with its associated (xxx.bai)
index or directory containing such. Multiple files are processed independently.
Unsorted xxx.sam(.gz/.zip OK) files also work but are processed rather slowly.
-f Full path to the single fasta file you wish to use in calculating the error rate.
-n Require read names to begin with indicated text, defaults to accepting everything.
-o Path to log file. Write coverage statistics to a log file instead of stdout.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/CalculatePerCycleErrorRate -b /Data/Bam/
-f /Fastas/chrPhiX_Illumina.fasta.gz -n HWI
**************************************************************************************
**************************************************************************************
** ChIPSeq: May 2014 **
**************************************************************************************
The ChIPSeq application is a wrapper for processing ChIP-Seq data through a variety of
USeq applications. It:
1) Parses raw alignments (sam, eland, bed, or novoalign) into binary PointData
2) Filters PointData for duplicate alignments
3) Makes relative ReadCoverage tracks from the PointData (reads per million mapped)
4) Runs the PeakShiftFinder to estimate the peak shift and optimal window size
5) Runs the MultipleReplicaScanSeqs to window scan the genome generating enrichment
tracks using DESeq2's negative binomial pvalues and B&H's FDRs
6) Runs the EnrichedRegionMaker to identify likely chIP peaks (FDR < 1%, >2x).
Options:
-s Save directory, full path.
-t Treatment alignment file directories, full path, comma delimited, no spaces, one
for each biological replica. These should each contain one or more text
alignment files (gz/zip OK) for a particular replica. Alternatively, provide
one directory that contains multiple alignment file directories.
-c Control alignment file directories, ditto.
-y Type of alignments, either novoalign, sam, bed, or eland (sorted or export).
-v Genome version (e.g. H_sapiens_Feb_2009, M_musculus_Jul_2007), see UCSC FAQ,
http://genome.ucsc.edu/FAQ/FAQreleases.
-r Full path to R, defaults to '/usr/bin/R'. Be sure to install DESeq2, gplots, and
qvalue Bioconductor packages.
Advanced Options:
-m Combine any replicas and run single replica analysis (ScanSeqs), defaults to
using DESeq2.
-a Maximum alignment score. Defaults to 60, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. Set to 0 for RNASeq data.
-p Peak shift, defaults to the PeakShiftFinder peak shift or 150bp. Set to 0 for
RNASeq data.
-w Window size, defaults to the PeakShiftFinder peak shift + stnd dev or 250bp.
-i Minimum number reads in window, defaults to 10.
-f Filter bed file (tab delimited: chr start stop) to use in excluding intersecting
windows while making peaks, e.g. satelliteRepeats.bed .
-g Print verbose output from each application.
-e Don't look for reduced regions.
Example: java -Xmx2G -jar pathTo/USeq/Apps/ChIPSeq -y eland -v D_rerio_Dec_2008 -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/PolIINRep1/,/Data/PolIINRep2/ -s
/Data/Results/WtVsNull -f /Anno/satelliteRepeats.bed
**************************************************************************************
**************************************************************************************
** CHPC Aligner: Sept 2013 **
**************************************************************************************
Wrapper for running novoalign on the CHPC clusters. You will need to configure ssh
keys from CHPC to your data server. See http://linuxproblem.org/art_9.html (might
need to reset your home dir on alta/moab 'chmod go-w ~/'). Run
this app at the CHPC.
Required Options:
-i Genome index file on CHPC
-r Working directory on CHPC, this also defines the name of the final data archive
-f First fastq file on the data server
-s (Optional) Second paired end read fastq file on the data server
-a Archive directory on the data server for saving the final alignments
Default Options:
-l Launch jobs, defaults to not launching jobs, inspect and test the shell scripts
before committing.
-w Wall time in hours, defaults to 24.
-x Number CPUs, defaults to 16.
-e Administrator email address, defaults to david.nix@hci.utah.edu
-c (Optional) Client email addresses, comma delimited, no spaces.
-b Don't relaunch bad jobs, defaults to making 3 attempts before aborting.
-o CHPC account to draw hours from (e.g. kaplan-em), defaults to kaplan.
-d Raw data user name and server, defaults to u0028003@hci-moab.hci.utah.edu
-g Final alignment data user name and server, defaults to
u0028003@hci-moab.hci.utah.edu
-j Aligner application, defaults to
'/uufs/chpc.utah.edu/common/home/hcibcore/tomato/app/novoalign/novoalign'
-p Aligner cmd line options
-n Number of reads to process per job, defaults to 1000000
-k Number of jobs to run, defaults to number of reads per job setting
-t Filter results for lines containing a 'chr' string, defaults to all.
-q Strip @SQ: lines from SAM alignment results, recommended for transcriptomes.
Example: java -Xmx4G -jar pathTo/USeq/Apps/CHPCAlign -p '-F ILMFQ -t60 -rRandom'
-i ~/Genomes/hg19Splices34bpAdaptersNovo.index
-r /scratch/serial/u0028003/7317X1_100602
-f /mnt/hci-ma/MicroarrayData/2010/7317R/GAII/100602_7317X1_s_7_1_sequence.txt.gz
-s /mnt/hci-ma/MicroarrayData/2010/7317R/GAII/100602_7317X1_s_7_2_sequence.txt.gz
-a /mnt/hci-ma/AnalysisData/2010/A115 -w 6 -e nix@gmail.com -t -b
**************************************************************************************
**************************************************************************************
** Compare Intersecting Regions: Nov 2012 **
**************************************************************************************
Compares test region file(s) against a master set of regions for intersection.
Reports the results as columns relative to the master. Assumes interbase coordinates.
Options:
-m Full path for the master bed file (tab delim: chr start stop ...).
-t Full path to the test bed file to intersect or directory of files.
-g Maximum bp gap allowed for scoring an intersection, defaults to 0 bp. Negative gaps
force overlaps, positive gaps allow non intersecting bases between regions.
Example: java -Xmx4G -jar pathTo/Apps/CompareIntersectingRegions -g 1000
-m /All/mergedRegions.bed.gz -t /IndividualERs/
************************************************************************************
**************************************************************************************
** Compare Parsed Alignments: Nov 2009 **
**************************************************************************************
Compares two parsed alignments for a common distribution of snps using R's Fisher's
Exact. Run the ParseIntersectingAlignments with the same snp table first.
Options:
-a Full path file name for the first xxx.alleles file.
-b Full path file name for the first xxx.alleles file.
-d Full path directory name for writing temporary files.
-r Full path file name for R, defaults to '/usr/bin/R'
Example: java -Xmx1500M -jar pathToUSeq/Apps/CompareParsedAlignments.
-a /SeqData/lymphSNPs.alleles -b /SeqData/normalSNPs.alleles -b /temp/
**************************************************************************************
**************************************************************************************
** Concatinate Fastas: Oct 2010 **
**************************************************************************************
Concatinates a directory of fasta files into a single sequence seperated by a defined
number of Ns. Outputs the merged fasta as well as bed files for the junctions and
spacers as well as a file to be used to shift UCSC gene table annotations. Use this
app to create artificial chromosomes for poorly assembled genomes.
Options:
-d Full path directory for saving the results.
-f Full path directory containing fasta files to concatinate.
-n Number of Ns to use as a spacer, defaults to 1000.
-c Name to give the concatinate, defaults to chrConcat .
Example: java -Xmx4G -jar pathTo/USeq/Apps/ConcatinateFastas -n 2000 -d
/zv8/MergedNA_Scaffolds -f /zv8/BadFastas/ -c chrNA_Scaffold
**************************************************************************************
**************************************************************************************
** CorrelatePointData: Aug 2011 **
**************************************************************************************
Calculates a Pearson Correlation Coefficient on the values of PointData found with the
same positions in the two datasets. Do NOT use on stair-step/ heat-map graph data.
Only use on point representation data.
Options:
-f First PointData set. This directory should contain chromosome specific xxx.bar.zip
files, stranded or unstranded.
-s Second PointData set, ditto.
-p Full path file name to use in saving paired scores, defaults to not printing.
Example: java -Xmx4G -jar pathTo/USeq/Apps/CorrelatePointData -f /BaseFracMethyl/X1
-s /BaseFracMethyl/X2
**************************************************************************************
***************************************************************
* CountChromosomes *
* *
* This script drives samtools view command. It will create *
* a report that lists counds to standard chroms, extra *
* chroms, phiX and adatpter. This data will be used in the *
* ParseMetrics App. *
* *
* -i Input file (bam format) *
* -o Output file (.txt format) *
* -r Reference (hg19, hg18, mm10, mm9 etc. *
* -p path to samtools *
***************************************************************
Output File not specified, exiting
**************************************************************************************
** Bisulfite Convert Fastas: Dec 2008 **
**************************************************************************************
Converts all the c/C's to t/T's in fasta file(s) maintaining case.
Required Parameters:
-f Full path text for the xxx.fasta file or directory containing such.
Example: java -Xmx2000M -jar pathTo/Apps/BisulfiteConvertFastas -f /affy/Fastas/
**************************************************************************************
**************************************************************************************
** Correlation Maps: Nov 2007 **
**************************************************************************************
CM calculates a correlation score for each window of genes and using permutation, an
empirical p-value. The correlation score is the mean of all pair Spearman ranks for
the gene expression profiles in each window. If a single value is given (unlogged!) for
each gene, a mean of the scores within each window is calculated.
To calculate p-values, X randomized datasets are created by shuffling the expression
profiles between genes, windows are scored and pooled. P-values for each real
score are calculated based on the area under the right side of the randomized score
distribution. In addition to a spread sheet report summary, heat map xxx.bar files
for the p-values and mean correlation are created for visualization in IGB.
Note, this analysis is not stranded. If so desired parse lists appropriately.
Parameters:
-f The full path file text for a tab delimited gene file (text,chr,start,stop,scores)
-o GenomicRegion filter file, full path file text for a tab delimited region file to use in
removing genes from correlation analysis. (chrom, start, stop).
-g Genome version for IGB visualizations (e.g. C_elegans_May_2007).
-w Window size, default is 50000bp. Setting this too small may exclude some regions.
-n Minimum number of genes required in each window, defaults to 3. Setting this too
high will exclude some regions.
-r Number random trials, defaults to 100
Example: java -Xmx256M -jar pathTo/T2/Apps/CorrelationMaps -f /Mango/geneFile.txt
-w 30000 -n 2 -o /Mango/operons.txt
**************************************************************************************
**************************************************************************************
** Convert Fasta A 2 G: Mar 2012 **
**************************************************************************************
Converts all the a/A's to g/G's in fasta file(s) maintaining case.
Required Parameters:
-f Full path for the fasta file (.fa/.fasta/.gz/.zip OK) or directory containing such.
-s Full path directory to save the converted files.
Example: java -Xmx2G -jar pathTo/Apps/ConvertFastaA2G -f /mm9/Fastas/ -s
/mm9/AGConvertedFastas/
**************************************************************************************
**************************************************************************************
** Convert Fastq A 2 G: Mar 2012 **
**************************************************************************************
Converts all the sequence A's to G's, case insensitive.
Required Parameters:
-f Full path for the fastq file or directory containing such. xxx.gz/.zip OK.
-s Optional, full path directory to save the converted files.
Example: java -Xmx2G -jar pathTo/Apps/ConvertFastqA2G -f /IllData/Fastq/
**************************************************************************************
**************************************************************************************
** Convert Fasta 2 GC Boolean: Aug 2008 **
**************************************************************************************
Converts fasta file(s) into serialized boolean[]s where every base g or c is true all
others false. Will also work with xxx.binarySeq files.
Required Parameters:
-f Full path text for the xxx.fasta file or directory containing such.
Example: java -Xmx2000M -jar pathTo/Apps/ConvertFasta2GCBoolean -f /affy/Fastas/
**************************************************************************************
**************************************************************************************
** Convert Fasta 2 GC Bar Graphs: April 2011 **
**************************************************************************************
Converts fasta files into graph files containing a 1 over each C in a CpG context.
Required Parameters:
-f Full path name for the directory containing xxx.fasta(.gz/.zip OK).
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Example: java -Xmx4G -jar pathTo/Apps/ConvertFasta2GCBarGraph -f /affy/Fastas/
-v H_sapiens_Feb_2009
**************************************************************************************
**************************************************************************************
** Defined Region Bis Seq: Dec 2013 **
**************************************************************************************
Takes two condition (treatment and control) PointData from converted and non-converted
C bisulfite sequencing data parsed using the NovoalignBisulfiteParser and scores user
defined regions for differential methylation using either a fisher or chi-square test.
A Benjamini & Hockberg correction is applied to convert the pvalues to FDRs. Data is
only collected on Cs that meet the minimum read coverage threshold in both datasets.
The fraction differential methylation statistic is calculated by taking the
pseudomedian of all of the log2 paired base level fraction methylations in a given
region. To examine particular mC contexts (e.g. mCG), first filter your PointData
using the ParsePointDataContexts app.
Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-s Save directory, full path.
-c Treatment converted PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files fro the NBP app.
One can also provide a single directory that contains multiple PointData
directories.
-C Control converted PointData directories, ditto.
-n Treatment non-converted PointData directories, ditto.
-N Control non-coverted PointData directories, ditto.
Default Options:
-d Minimum per base read coverage, defaults to 5.
-r Full path to R, defaults to '/usr/bin/R'
Example: java -Xmx10G -jar pathTo/USeq/Apps/DefinedRegionBisStat -c /Sperm/Converted
-n /Sperm/NonConverted -C /Egg/Converted -N /Egg/NonConverted -s /Res/DRBS
-b /Res/CpGIslands.bed
**************************************************************************************
**************************************************************************************
** Defined Region Differential Seq: Sept 2014 **
**************************************************************************************
DRDS takes sorted bam files, one per replica, minimum one per condition, minimum two
conditions (e.g. treatment and control or a time course/ multiple conditions) and
identifies differentially expressed genes using DESeq2 or SAMTools. DESeq2's rLog
normalized count data is used to heirachically cluster the samples. Differential
splicing is estimated using a chi-square test of independence. When testing only a
few genes or regions, append these onto a full gene table so that DESeq2 can
appropriately estimate the library size and replica variance.
Options:
-s Save directory.
-c Conditions directory containing one directory for each condition with one xxx.bam
file per biological replica and their xxx.bai indexs. 3-4 reps recommended per
condition. The BAM files should be sorted by coordinate using Picard's SortSam.
All spice junction coordinates should be converted to genomic coordinates, see
USeq's SamTranscriptomeParser.
-r Full path to R (version 3+) loaded with DESeq2, samr, and gplots defaults to
'/usr/bin/R' file, see http://www.bioconductor.org . Type 'library(DESeq2);
library(samr); library(gplots)' in R to see if they are installed.
-u UCSC RefFlat or RefSeq gene table file, full path. Tab delimited, see RefSeq Genes
http://genome.ucsc.edu/cgi-bin/hgTables, (uniqueName1 name2(optional) chrom
strand txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 C1orf64 chr1 + 16203317
16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 . NOTE:
this table should contain only ONE composite transcript per gene (e.g. use
Ensembl genes NOT transcripts). Use the MergeUCSCGeneTable app to collapse
transcripts. See http://useq.sourceforge.net/usageRNASeq.html for details.
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
-g Genome Version (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Advanced Options:
-m Mask overlapping gene annotations, recommended for well annotated genomes.
-x Max per base alignment depth, defaults to 50000. Genes containing such high
density coverage are ignored.
-n Max number alignments per read. Defaults to 1, unique. Assumes 'NH' tags have
been set by processing raw alignments with the SamTranscriptomeProcessor.
-e Minimum number alignments per gene-region per replica, defaults to 10.
-i Score introns instead of exons.
-p Perform a stranded analysis. Only collect reads from the same strand as the
annotation.
-j Reverse stranded analysis. Only collect reads from the opposite strand of the
annotation. This setting should be used for the Illumina's strand-specific
dUTP protocol.
-k Second read's strand is flipped. Otherwise, assumes this was not done in the
SamTranscriptomeParser.
-t Don't delete temp files (R script, R results, Rout, etc..).
-a Run SAMseq in place of DESeq2. This is only recommended with five or more
replicates per condition.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionDifferentialSeq -c
/Data/TimeCourse/ESCells/ -s /Data/TimeCourse/DRDS -g H_sapiens_Feb_2009
-u /Anno/mergedHg19EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Defined Region RNA Editing: April 2014 **
**************************************************************************************
DRRE scores regions for the pseudomedian of the base fraction edits as well as the
probability that the observations occured by chance using a permutation test based on
the chiSquare goodness of fit statistic.
Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-e Edited PointData directory from the RNAEditingPileUpParser.
These should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. These will be merged when scanning.
-r Reference PointData directory from the RNAEditingPileUpParser. Ditto.
-a Minimum base read coverage, defaults to 5.
-t Run a stranded analysis, defaults to non-stranded.
-i Remove base fraction edits that are non zero and represented by just one edited
base.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionRNAEditing -b hg19UTRs.bed
-e /PointData/Edited -r /PointData/Reference
**************************************************************************************
**************************************************************************************
** Defined Region Scan Seqs: March 2011 **
**************************************************************************************
DRSS takes chromosome specific PointData xxx.bar.zip files and extracts scores under
each region to calculate several statistics including a binomial p-value, Storey
q-value FDR, an empirical FDR, a p-value for strand skew, and a chi-square test of
independence between the exon read count distributions between treatment and control
data (a test for alternative splicing). Several measures of read counts are provided
including counts for each strand, a normalized log2 ratio, and RPKMs (# reads per kb
of interrogated region per total million mapped reads). If a gene table is provided,
scores under each exon are summed to give a whole gene summary. It is also recommended
to run a gene table of introns (see the ExportIntronicRegions app) to look for
intronic retention and novel transfrags/ exons. If one provides splice junction bed
files for treatment and control RNA-Seq data, see the NovoalignParser, splice
junctions will be scored for differential expression. This is an additional
calculation unrelated to the chi-square independance test. Lastly, if control
data is not provided, simple region sums are calculated.
Options:
-s Save directory, full path.
-t Treatment PointData directories, full path, comma delimited. These should
contain unshifted stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-c Control PointData directories, ditto.
-p Peak shift, average distance between + and - strand peaks for chIP-Seq data, see
PeakShiftFinder. For RNA-Seq set to the smallest expected fragment size. Will
be used to shift the PointData 3' by 1/2 the peak shift.
-r Full path to R loaded with Storey's q-value library, defaults to '/usr/bin/R'
file, see http://genomics.princeton.edu/storeylab/qvalue/
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds)
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
Advanced Options:
-o Don't remove overlapping exons, defaults to filtering gene annotation for overlaps.
-i Score introns instead of exons.
-f Scan for just enriched regions, defaults to look for both. Only use with chIP-Seq
datasets where the control is input. This turns on the empFDR estimation.
-d Treatment splice junction bed file(s) from the NovoalignParser, comma delimited,
full path.
-e Control splice junction bed file(s), comma delimited, full path.
-m Minimum number of reads in associated gene before scoring splice junctions.
Used in estimating the expected proportion of T and scaling the log2Ratio.
Defaults to 100.
-w Use read score probabilities (assumes scores are > 0 and <= 1), defaults to
assigning 1 to each read score. Experimental.
Example: java -Xmx4G -jar pathTo/USeq/Apps/DefinedRegionScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults -p 100 -b /Data/selectRegions.bed -f
**************************************************************************************
**************************************************************************************
** DRDS Annotator: January 2014 **
**************************************************************************************
This application annotates DefinedRegionDifferentialSeq xlsx files using Ensembl
biomart tab-delimited annotation files. By default, ensembl biomart output files will
list the Ensembl gene id in the first column and Ensembl transcript id in the second
column. This application assumes these defaults. It will match the gene id in the
first column of the biomart file to the name listed in the 'IGB HyperLink' column
found in the 'Analyzed Genes' tab of the DRDS xlxs output. All biomart columns after
the transcript id column are added to the output file. The data is inserted between
the 'Alt Name' and locus columns in the 'Analyzed Genes' tab.
The biomart output files can have multiple annotation lines for each gene id.
Currently, this app uses the first annotation line encountered.
Required Arguments:
-i Input file. Path to DRDS xlsx output file you wish to annotate
-a Annotation file. Path to biomart annotation file.
-o Annotated output file. Path to the annotated output file
Example: java -Xmx4G -jar pathTo/USeq/Apps/DRDSAnnotator -i geneStats.xlsx
-a mm10.biomart.txt -o geneStats.ann.xlsx
**************************************************************************************
**************************************************************************************
** Enriched Region Maker: July 2013 **
**************************************************************************************
ERM combines windows from ScanSeqs xxx.swi files into larger enriched or reduced
regions based on one or more scores. For each score index, you must provide a minimal
score. Adjacent windows that exceed the minimum score(s) are merged and the best
window scores applied to the region. If treatment and control PointData are provided,
the best 25bp peak within each region will be identified and each ER rescored. To
select for ERs with a 1% FDR and 2x enrichment above control, follow the example
assuming score indexes 1,2,4 correspond to QValFDR, EmpFDR, and
Log2Ratio. Note, if you are performing a static analysis comparing chIP vs chIP,
don't set thresholds on the EmpFDR, this was disabled and all of the values are zero.
To print descriptions of the score indexes, complete the command line and skip the
-i option. Lastly, FDRs and p-values are represented in USeq in a transformed state,
as -10Log10(FDR/p-val) where 13 = 5%, 20 = 1%, etc. To select for regions with an FDR of
less than 1% you would set a threshold of 20 for the QValFDR and, if running a static
analysis, the EmpFDR.
Options:
-f Full path file name for the serialized xxx.swi file from ScanSeqs, if a
directory is specified, all xxx.swi files will be processed.
-s Minimal score(s) one for each score idex, comma delimited, no spaces.
-i Score index(s) one for each minimum score.
Advanced Options:
-n Make a given number of ERs, one or more, comma delimited, no spaces. Uses score
index 0.
-m Multiply scores by -1 to make reduced regions instead of enriched regions.
-r Remove windows that intersect a list of regions. Enter a full path tab delimited
regions file text (chr start stop) Coordinates are assumed to be zero based and
stop inclusive. Useful for excluding regions from ER generation.
-b BP buffer to subtract and add to start and stops of regions used in filtering
intersecting windows, defaults to 0.
-e Exclude entire ERs that intersect the -r regions, defaults to removing windows.
This is more exclusive and will not simply punch holes in ERs but throw out
The entire ER.
-g Max gap, defaults to the size of the window used in ScanSeqs.
-t Provide treatment PointData directories, full path, comma delimited to ID the peak
center in each ER. These should contain the same unshifted stranded chromosome
specific xxx_-/+_.bar.zip files used in ScanSeqs.
-c Control PointData directories, ditto.
-p Full path to R, defaults to '/usr/bin/R', required for rescanning ERs.
-w Sub window size, defaults to 25bp.
Example: java -Xmx500M -jar pathTo/USeq/Apps/EnrichedRegionMaker -f /solexa/zeste.swi
-i 1,2,4 -s 20,20,1 -w 50
**************************************************************************************
**************************************************************************************
** Eland Multi Parser: October 2008 **
**************************************************************************************
Parses an Eland xxx.eland_multi.txt alignment file tabulating hits to each fasta entry.
Good for scoring hits to a transcriptome where every fasta entry represents a
different gene.
-f The full path directory/file text of your xxx.eland_multi.txt(.zip) file(s). Files
will be merged.
-r Full path file text for saving the results.
Example: java -Xmx1500M -jar pathToUSeq/Apps/ElandMultipParser -f
/data/MultiFiles/ -r /data/transcriptomeResults.xls
**************************************************************************************
**************************************************************************************
** ElandParser: May 2008 **
**************************************************************************************
Splits and converts Eland Extended xxx_export.txt or xxx_sorted.txt files
into center position alignment scored binary xxx.bar files. Coordinates are in
interbase coordiantes (zero based, stop excluded). These can be directly viewed in IGB.
-v Versioned Genome (ie hg18, dm2, ce2, mm8), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-m Minimum aligment score, Phred scale, defaults to 13. Not used with stand alone.
-f The full path directory/file text of your xxx_export.txt(.zip/.gz) or
xxx_sorted.txt(.zip/.gz) file(s).
-r Full path directory text for saving the results, defaults to export.txt parent.
-s Shift centered position N bps 3' to accomodate chIP-seq fragment size. Stranded.
Note, this is far less than 1/2 the expected fragment size, determine best
value by visual inspection of likely positives. Defaults to 0. If you plan on
filtering your PointData, don't shift their positions, do it in the filter app.
-p Parse stand alone Eland output file.
Example: java -Xmx1500M -jar pathToUSeq/Apps/ElandParser -f /Solexa/Run7/
-v H_sapiens_Mar_2006 -s 38 -r /Solexa/ParsedData/PolIII/
**************************************************************************************
**************************************************************************************
** Eland Sequence Parser: March 2009 **
**************************************************************************************
Parses sequence information from Eland Extended alignment summary files. For every
base, sums the quality scores generating a G, A, T, and C track xxx.bar file for
visualization in IGB. Also generates a consensus track (1-fraction consensus) for
each base.
-f The full path directory/file text of your xxx_export.txt(.zip/.gz) or
xxx_sorted.txt(.zip/.gz) file(s).
-r Full path directory text for saving the results.
-g Full path directory text containing fasta files for reference base calling
(e.g. chr1.fasta, chr5.fasta).
-v Versioned Genome (ie hg18, dm2, ce2, mm8), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-a Minimum aligment score, -10Log10(p-value), defaults to 13.
-c Minimum consensus score, -10Log10(p-value), defaults to 60.
Example: java -Xmx1500M -jar pathToUSeq/Apps/ElandSequenceParser -v hg18 -c 90
-f /data/ExportFiles/ -r /data/Results -g /genomes/Hg18Fastas
**************************************************************************************
**************************************************************************************
** Export Exons Sept 2013 **
**************************************************************************************
EE takes a UCSC Gene table and prints the exons to a bed file.
Parameters:
-g Full path file text for the UCSC Gene table.
-a Expand the size of each exon by X bp, defaults to 0
-u Remove UTRs if present, defaults to including
-n Append exon numbers to the gene name field. This makes the bed file compatible
with DRDS
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportExons -g /user/Jib/ucscPombe.txt
-a 50
**************************************************************************************
**************************************************************************************
** Export Intergenic Regions May 2007 **
**************************************************************************************
EIR takes a gff file and uses it to mask a boolean array. Parts of the boolean array
that are not masked are returned and represent integenic sequences. Be sure to put in
a gff line at the stop of each chromosome noting the last base so you caputure the last
intergenic region. (eg chr1 GeneDB lastBase 3600000 3600001 . + . lastBase). Base
coordinates are assumed to be stop inclusive, not interbase.
Parameters:
-g Full path file text for a gff file or directory containing such.
-t Base pairs to trim from the ends of each intergenic region, defaults to 0.
-m Minimum acceptable intergenic size, those smaller will be tossed, defaults to 60bp
-s Subtract one from the start and stop coordinates.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportIntergenicRegions -s -m 100 -g
/user/Jib/GffFiles/Pombe/sanger.gff
**************************************************************************************
**************************************************************************************
** Export Intronic Regions June 2007 **
**************************************************************************************
EIR takes a UCSC Gene table and fetches the most conservative/ smallest intronic
regions. Base coordinates are assumed to be stop inclusive, not interbase.
Parameters:
-g Full path file text for the UCSC Gene table.
-m Minimum acceptable intron size, those smaller will be tossed, defaults to 60bp
-s Subtract one from the stop coordinates of your UCSC table to convert from interbase.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportIntronicRegions -s -m 100 -g
/user/Jib/ucscPombe.txt
**************************************************************************************
**************************************************************************************
** Export Trimmed Genes May 2012 **
**************************************************************************************
EE takes a UCSC Gene table and clips each gene back to the first intron closed by a
coding sequence exon. Thus these include all of the 5'UTRs. Genes with no introns are
removed.
Parameters:
-g Full path file text for the UCSC Gene table.
-u Print just UTRs, defaults to UTRs plus 1st CDS intron with flanking exon.
-i Print just 1st CDS intron with flanking exons.
Example: java -Xmx1000M -jar pathTo/T2/Apps/ExportTrimmedGenes -u -g
/user/Jib/ucscPombe.txt
**************************************************************************************
**************************************************************************************
** FetchGenomicSequences: Feb 2013 **
**************************************************************************************
Given a file containing genomic coordinates, fetches and saves the sequence (column
output: chrom origStart origStop fetchedStart fetchedStop completeFetch seq).
-f Full path to a file or directory containing tab delimited chrom, start,
stop text files. Interbabase coordinates (zero based, stop excluded).
-s Full path directory text containing containing genomic fasta files. The fasta
header defines the name of the sequence, not the file name.
-b Fetch flanking bases, defaults to 0. Will set start to zero or stop to last base if
boundaries are exceeded.
-r Reverse complement fetched sequences, defaults to returning the + genomic strand.
-a Output fasta format.
Example: java -Xmx1000M -jar pathTo/T2/Apps/FetchGenomicSequences -f /data/miRNAs.txt
-s /genomes/human/v35.1/ -b 5000 -r
**************************************************************************************
**************************************************************************************
** Find Neighboring Genes: Nov 2008 **
**************************************************************************************
FNG takes a list of genes in UCSC Gene Table format and intersects them with a list of
regions finding the closest gene to each region as well as all of the genes that fall
within a given neighborhood. Distance is measured from the center of the region to the
transcription start site/ 1st base position in 1st exon. See Tables link under
http://genome.ucsc.edu/ . Note, output coordinates are zero based, stop inclusive.
-g Full path file text for a tab delimited UCSC Gene Table (text chrom strand txStart
txEnd cdsStart cdsEnd exonCount exonStarts exonEnds etc...) .
-p Full path file/directory text for tab delimited region list(s) (chr, start, stop) .
-b Size of neighborhood in bp, default is 10000
-f Find genes that overlap neighborhood irregardles of distance to TSS.
-c Only print closest genes.
-o Print neighbors on one line.
Example: java -jar pathTo/T2/Apps/FindNeighboringGenes -g /anno/hg17Ensembl.txt -p
/affy/p53/finalPicks.txt -b 5000 -c
**************************************************************************************
**************************************************************************************
** Find Overlapping Genes: Oct 2010 **
**************************************************************************************
Finds overlapping genes that converge, diverge, or contain one another given a UCSC
gene table.
Options:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds). NOTE:
this table should contain only one composite transcript per gene (e.g. Use
Ensembl genes NOT transcripts. See MergeUCSCGeneTable app.).
Example: java -Xmx4G -jar pathTo/USeq/Apps/FindOverlappingGenes -u
/data/zv8EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Find Shared Regions: Dec 2011 **
**************************************************************************************
Writes out a bed file of shared regions. Interbase coordinates.
Options:
-f First bed file (tab delimited: chr start stop ...).
-s Second bed file.
-r Results file.
-m Minimum length, defaults to 0.
Example: java -Xmx4G -jar pathTo/USeq/Apps/FindSharedRegions -f
/Res/firstBedFile.bed -s /Res/secondBedFile.bed -r /Res/common.bed -m 100
************************************************************************************
**************************************************************************************
** File Cross Filter: March 2008 **
**************************************************************************************
FCF take a column in the matcher file and uses it to parse the rows from other files.
Useful for pulling out and printing in order the rows that match the first file.
-m Full path file text for a tab delimited txt file to use in matching.
-f Full path file text to parse, can specify a directory too.
-i Ignore duplicate keys.
-a Column index containing the unique IDs in the matcher, defaults to 0.
-b Column index containing the unique IDs in the parsers, defaults to 0.
Example: java -jar pathTo/T2/Apps/FileCrossFilter -f /extendedArrayData/ -m /old/
originalArray.txt -a 2 -b 2
**************************************************************************************
**************************************************************************************
** File Match Joiner: July 2008 **
**************************************************************************************
FMJ loads a file and a particular column containing unique entries, a key, and then
appends the key line to lines in the parsed file that match a particular column.
Usefull for appending say chromosome coordinates to snp ids data, etc.
-k Full path file text for a tab delimited txt file (key) containing unique entries.
-f Ditto but for the file to parse, can specify a directory too.
-i Collapse duplicate keys.
-j Skip duplicate keys.
-a Column index containing the unique IDs in key, defaults to 0.
-b Column index containing the unique IDs in parsers, defaults to 0.
-p Print only matches.
Example: java -jar pathTo/Apps/FileMatchJoiner -k /snpChromMap.txt -m /SNPData/
--b 2 -p
**************************************************************************************
**************************************************************************************
** File Joiner: Feb 2005 **
**************************************************************************************
Joins text files into a single file, avoiding line concatenations. This is a problem
with using 'cat * >> combine.txt'. Removes empty lines.
Required Parameters:
-f Full path text for the directory containing the text files.
Example: java -jar pathTo/T2/Apps/FileJoiner -f /affy/SplitFiles/
**************************************************************************************
**************************************************************************************
** File Splitter: July 2010 **
**************************************************************************************
Splits a big text file into smaller files given a maximum number of lines.
Required Parameters:
-f Full path file text or directory for the text file(s) (.zip/.gz OK).
-n Maximum number of lines to place in each.
-g GZip split files.
Example: java -Xmx256M -jar pathTo/T2/FileSplitter -f /affy/bpmap.txt -n 50000
**************************************************************************************
**************************************************************************************
** FilterDuplicateAlignments: Mar 2010 **
**************************************************************************************
Filters alignments for potential amplification bias by randomly selecting X alignments
from those with the same chromosome, position, and strand. Can also filter for the
best unique alignment based on read score. Column indexes start with 0.
Options:
-f Full path file/ directory text containing tab delimited alignments.
-r Full path directory for saving the results.
-c Alignment chromosome column index.
-p Alignment position column index, assumes this is always referenced to the + strand
-s Alignment sequence column index.
-t Strand column index.
-m Save a max number of identical alignments, choose number, defaults to random
unique sequences.
-b Save only the best alignment per start postion, defined by total score. Indicate
which column contains the quality ascii text.
-j Include splice junction chromosomes in filtering (e.g. chr7_101267544_101272320).
Defaults to removing them. (Only keep for RNA-Seq datasets.)
Example: java -Xmx1500M -jar pathToUSeq/Apps/FilterDuplicateAlignments -f
/Novoalign/Run7/ -s /Novoalign/Run7/DupFiltered/ -c 7 -p 8 -s 2 -b 3 -t 9
Use -c 10 -p 12 -s 8 -t 13 -b 9 for ELAND sorted or export alignments.
**************************************************************************************
**************************************************************************************
** Graph 2 Bed: Feb 2011 **
**************************************************************************************
Converts USeq stair step and heat map graphs into region bed files using a threshold.
Do not use this with non USeq generated graphs. Won't work with bar or point graphs.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. May point this to a single directory
of such too.
-t Threshold, regions exceeding it will be saved, defaults to 0.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/Graph2Bed -t 9 -p /data/ReadCoverage
**************************************************************************************
**************************************************************************************
** Filter Intersecting Regions: Oct 2013 **
**************************************************************************************
Flattens the mask regions and uses it to split the split file(s) into intersecting
and non intersecting regions based on the minimum fraction intersection.
Options:
-m Full path file text for the masking bed file (tab delim: chr start stop ...).
-b Full path file text for the bed file to split into intersecting and non
intersecting regions. Can also point to a directory of files to split.
-g (Or) Full path file text for the gff/ gtf file to split into intersecting and non
intersecting regions. Can also point to a directory of files to split.
-i Minimum fraction of each split region required to score as an intersection with
the flattened mask, defaults to 1x10-1074
Example: java -Xmx4000M -jar pathTo/Apps/FilterIntersectingRegions -i 0.5
-m /ArrayDesigns/repMskedDesign.bed -b /ArrayDesigns/novoMskedDesign.bed
************************************************************************************
**************************************************************************************
** Filter Point Data: Oct 2012 **
**************************************************************************************
FPD drops or saves observations from PointData that intersect a list of regions
(e.g. repeats, interrogated regions).
Options:
-p Point Data directories, full path, comma delimited. These should contain
chromosome specific xxx.bar.zip files.
-r Full path file text for a tab delimited text file containing regions to use in
filtering the intersecting data (chr start stop ..., interbase coordinates).
-i Select data that intersects the list of regions, defaults to selecting data that
doesn't intersect.
-a Acceptible intersection, fraction, defaults to 0.5
-n Just calculate the number of observations after filtering, don't save any data.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/FilterPointData -p /data/PointData
-r /repeats/hg18RepeatMasker.bed -a 0.75
**************************************************************************************
**************************************************************************************
** Generate Overlas: Dec 2012 **
**************************************************************************************
Merges proper paired alignments that pass a variety of checks and thresholds. Only
unambiguous pairs will be merged. Increases base calling accuracy in overlap and helps
avoid non-independent variant observations and other double counting issues. Identical
overlapping bases are assigned the higher quality scores. Disagreements are resolved
toward the higher quality base. If too close in quality, then the quality is set to 0.
Be certain your input bam/sam file(s) are sorted by query name, NOT coordinate.
Options:
-f The full path file or directory containing raw xxx.sam(.gz/.zip OK)/.bam file(s)
paired alignments.
Multiple files will be merged.
Default Options:
-a Maximum alignment score (AS:i: tag). Defaults to 120, smaller numbers are more
stringent. Approx 30pts per mismatch for novoalignments.
-q Minimum mapping quality score, defaults to 13, larger numbers are more stringent.
Set to 0 if processing splice junction indexed RNASeq data.
-r The second paired alignment's strand is reversed. Defaults to not reversed.
-d Maximum acceptible base pair distance for merging, defaults to 5000.
-m Don't cross check read mate coordinates, needed for merging repeat matches. Defaults
to checking.
-l Output file name. Write merging statitics to file instead of standard output.
Example: java -Xmx1500M -jar pathToUSeq/Apps/MergePairedSamAlignments -f /Novo/Run7/
-c -s /Novo/STPParsedBams/run7.bam -d 10000
**************************************************************************************
**************************************************************************************
** Gr2Bar: Nov 2006 **
**************************************************************************************
Converts xxx.gr.zip files to chromosome specific bar files.
-f The full path directory/file text for your xxx.gr.zip file(s).
-v Genome version (ie H_sapiens_Mar_2006), get from UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases
-o Orientation of GR file. If not specified, orientation is left as '.'
Example: java -Xmx1500M -jar pathTo/T2/Apps/Gr2Bar -f /affy/GrFiles/ -v hg17
**************************************************************************************
**************************************************************************************
** Inosine Predict: Aug 2010 **
**************************************************************************************
IP estimates the likelihood of ADAR RNA editing using the multiplicative 4L,4R model
described in Eggington et. al. 2010.
Options:
-f Multi fasta file containing sequence(s) to score.
-m Maxtrix scoring file.
-p Print an example matrix.
-o Don't include the opposite strand.
-s Save directory, defaults to parent of the fasta file.
-z Name of a zip archive to create containing the results.
Example: java -Xmx2G -jar pathTo/USeq/Apps/InosinePredict -m
~/ADARMatrix/hADAR1-D.matrix.txt -f ~/SeqsToScore/candidates.fasta.gz
**************************************************************************************
**************************************************************************************
** Intersect Lists: Dec 2008 **
**************************************************************************************
IL intersects two lists (of genes) and using randomization, calculates the
significance of the intersection and the fold enrichment over random. Note, duplicate
items are filtered from each list prior to analysis.
-a Full path file text for list A (or directory containing), one item per line.
-b Full path file text for list B (or directory containing), one item per line.
-t The total number of unique items from which A and B were drawn.
-n Number of permutations, defaults to 1000.
-p Print the intersection sets (common, unique to A, unique to B) to screen.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectLists -a /Data/geneListA.txt -b
/Data/geneListB.txt -t 28356 -n 10000
**************************************************************************************
**************************************************************************************
** Intersect Key With Regions: July 2012 **
**************************************************************************************
IR intersects lists of genomicRegions (chrom start stop(inclusive)) with a key, assumes the
lists are sorted from most confident to least confident. Multiple hits to the same key
region are ignored.
-k Full path file text for the key genomicRegions file, tab delimited (chr start
stop(inclusive)).
-r Full path file text or directory containing your region files to score.
-g Max gap, defaults to -1. A max gap of 0 = genomicRegions must abut, negative values force
overlap (ie -1= 1bp overlap, be careful not to exceed the length of the smaller
region), positive values enable gaps (ie 1=1bp gap).
-s Subtract 1 from end coordinates. Use for interbase.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectKeyWithRegions -k /data/key.txt
-r /data/HitLists/
**************************************************************************************
**************************************************************************************
** Intersect Regions: May 2012 **
**************************************************************************************
IR intersects lists of regions (tab delimited: chrom start stop(inclusive)). Random
regions can also be used to calculate a p-value and fold enrichment.
-f First regions files, a single file, or a directory of files.
-s Second regions files, a single file, or a directory of files.
-g Max gap, defaults to 0. A max gap of 0 = regions must at least abut or overlap,
negative values force overlap (ie -1= 1bp overlap, be careful not to exceed the
length of the smaller region), positive values enable gaps (ie 1=1bp gap).
-e Score intersections where second regions are entirely contained by first regions.
-r Make random regions matched to the second regions file(s) and intersect with the
first. Enter either a bed file or full path directory that contains chromosome
specific interrogated regions files (ie named: chr1, chr2 ...: chrom start stop).
-c Match GC content of second regions file(s) when selecting random regions, rather
slow. Provide a full path directory text containing chromosome specific genomic
sequences.
-n Number of random region trials, defaults to 1000.
-w Write intersections and differences.
-x Write paired intersections.
-p Print length distribution histogram for gaps between first and closest second.
-q Parameters for histogram, comma delimited list, no spaces:
minimum length, maximum length, number of bins. Defaults to -100, 2400, 100.
Example: java -Xmx1500M -jar pathTo/Apps/IntersectRegions -f /data/miRNAs.txt
-s /data/DroshaLists/ -g 500 -n 10000 -r /data/InterrogatedRegions/
**************************************************************************************
**************************************************************************************
** Kegg Pathway Enrichment: Aug 2009 **
**************************************************************************************
KPE looks for overrepresentation of genes from a user's list in Kegg pathways using a
random permutation test. Several files are needed from http://www.genome.jp/kegg
Gene names must be in Ensembl Gene notation and begin with ENSG.
Options:
-e Full path file text for a KeggGeneIDs : EnsemblGeneIDs file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/genes/organisms/hsa/hsa_ensembl-hsa.list)
-p Full path file text for a KeggPathwayIDs : TextDescription file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/pathway/map_title.tab)
-g Full path file text for a KeggGeneIDs : KeggPathwayIDs file (e.g. Human
ftp://ftp.genome.jp/pub/kegg/pathway/organisms/hsa/hsa_gene_map.tab)
-a Full path file text for your all interrogated Ensembl gene list (e.g. ENSG00...)
One gene per line.
-s Full path file text for your select gene list.
-n Number of random iterations, defaults to 10000
Example: java -Xmx1500M -jar pathTo/USeq/Apps/KeggPathwayEnrichment -e
/Kegg/hsa_ensembl-hsa.list -p /Kegg/map_title.tab -g /Kegg/hsa_gene_map.tab
-a /HCV/ensemblGenesWith20OrMoreReads.txt -s /HCV/upRegInHCV_Norm.txt
**************************************************************************************
**************************************************************************************
** MaqSnps2Bed: June 2009 **
**************************************************************************************
Converts a Maq snp text file (1 based coordinates) into a bed file (interbase
coordinates). Also writes out an Alleler formated text file.
-f Full path file text to the file or directory containing Maq snp txt files.
Example: java -Xmx1000M -jar path2/USeq/Apps/MaqSnps2Bed -f /data/maqSnpFile.txt
**************************************************************************************
**************************************************************************************
** Make Splice Junction Fasta: Nov 2010 **
**************************************************************************************
DEPRECIATED, don't use! See MakeTranscriptome app!
MSJF creates a multi fasta file containing sequences representing all possible linear
splice junctions. The header on each fasta is the chr_endPosExonA_startPosExonB. The
length of sequence collected from each junction is 2x the radius. A word of warning,
be very careful about the coordinate system used in the gene table to define the
start and stop of exons. UCSC uses interbase and this is assumed in this app. Check
a few of the junctions to be sure correct splices were made. All junction sequences
are from the top/ plus strand of the genome, they are not reverse complemented. Exon
sequence shorter than the radius will be appended with Ns.
Options:
-f Fasta file directory, should contain chromosome specific xxx.fasta files.
-u UCSC gene table file, full path. See, http://genome.ucsc.edu/cgi-bin/hgTables
-s Sequence length radius.
-r Results fasta file, full path.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MakeSpliceJunctionFasta -s 32
-f /Genomes/Hg18/Fastas/ -u /Anno/Hg18/ucscKnownGenes.txt -r
/Genomes/Hg18/Fastas/hg18_32_splices.fasta
************************************************************************************
**************************************************************************************
** Make Transcriptome: June 2012 **
**************************************************************************************
Takes a UCSC ref flat table of transcripts and generates two multi fasta files of
transcripts and splices (known and theoretical). All possible unique splice junctions
are created given the exons from each gene's transcripts. In some cases this is
computationally intractable and theoretical splices from these are not complete.
Read through occurs with small exons to the next up or downstream so keep the sequence
length radius to a minimum to reduce the number of junctions. Overlapping exons are
assumed to be mutually exclusive. All sequence is from the plus genomic stand, no
reverse complementation. Interbase coordinates. This app can take a very long time to
run. Break up gene table by chromosome and run on a cluster.
To incorporate additional splice-junctions, add a new annotation line containing two
exons representing the junction to the table. If needed, set the -s option to skip
duplicates.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-u UCSC RefFlat gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName transcriptName chrom strand
txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 ENST00000329454 chr1 +
16203317 16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 .
-r Sequence length radius. Set to the read length - 4bp.
-n Max number splices per transcript, defaults to 100000.
-m Max minutes to process each gene's splices before interrupting, defaults to 10.
-s Skip subsequent occurrences of splices with the same coordinates. Memory intensive.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MakeTranscriptome -f /Genomes/Hg18/Fastas/
-u /Anno/Hg18/ensemblGenes.txt.ucsc -r 46 -s
************************************************************************************
**************************************************************************************
** Mask Exons In Fasta Files: June 2011 **
**************************************************************************************
Replaces the exonic sequence with Ns.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-u UCSC RefFlat gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName transcriptName chrom strand
txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 ENST00000329454 chr1 +
16203317 16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 .
-s Save directory, full path.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MaskExonsInFastaFiles -f
/Genomes/Hg18/Fastas/ -u /Anno/Hg18/ensemblTranscripts.txt.ucsc -s
/Genomes/Hg18/MaskedFastas/
************************************************************************************
**************************************************************************************
** Mask Regions In Fasta Files: Dec 2011 **
**************************************************************************************
Replaces the region (or non region) sequence with Ns. Interbase coordinates.
Options:
-f Fasta file directory, one per chromosome (e.g. chrX.fasta or chrX.fa, .gz/.zip OK)
-b Bed file of regions to mask.
-s Save directory, full path.
-r Mask sequence not in regions, reverse mask.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MaskRegionsInFastaFiles -f
/Genomes/Hg18/Fastas/ -b /Anno/Hg18/badRegions.bed -s
/Genomes/Hg18/MaskedFastas/
************************************************************************************
**************************************************************************************
** MaxEntScanScore3: Nov 2013 **
**************************************************************************************
Implementation of Max Ent Scan's score3 algorithm for human splice site detection. See
Yeo and Burge 2004, http://www.ncbi.nlm.nih.gov/pubmed/15285897
Options:
-s Full path directory name containing the me2x3acc1-9 splice model files. See
USeq/Documentation/ or http://genes.mit.edu/burgelab/maxent/download/
-t Full path file name for 23mer test sequences, GATCgatc only, one per line. Fasta OK.
Example: java -Xmx10G -jar pathTo/USeq/Apps/MaxEntScanScore3 -s ~/MES/splicemodels -t
~/MES/seqsToTest.fasta
**************************************************************************************
**************************************************************************************
** MaxEntScanScore5: Nov 2013 **
**************************************************************************************
Implementation of Max Ent Scan's score5 algorithm for human splice site detection. See
Yeo and Burge 2004, http://www.ncbi.nlm.nih.gov/pubmed/15285897
Options:
-s Full path directory containing the splice5sequences and me2x5 splice model files.
See USeq/Documentation/ or http://genes.mit.edu/burgelab/maxent/download/
-t Full path file name for 9mer test sequences, GATCgatc only, one per line. Fasta OK.
Example: java -Xmx10G -jar pathTo/USeq/Apps/MaxEntScanScore5 -s ~/MES/splicemodels -t
~/MES/seqsToTest.fasta
**************************************************************************************
**************************************************************************************
** MergeExonMetrics : June 2013 **
**************************************************************************************
This app simply merges the output from several metrics html files.
Required:
-f Directory containing metrics dictionary files and a image directory
-o Name of the combined metrics file
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MergeExonMetrics -f metrics -o 9908_metrics
**************************************************************************************
**************************************************************************************
** MergePairedSamAlignments: Dec 2012 **
**************************************************************************************
Merges proper paired alignments that pass a variety of checks and thresholds. Only
unambiguous pairs will be merged. Increases base calling accuracy in overlap and helps
avoid non-independent variant observations and other double counting issues. Identical
overlapping bases are assigned the higher quality scores. Disagreements are resolved
toward the higher quality base. If too close in quality, then the quality is set to 0.
Be certain your input bam/sam file(s) are sorted by query name, NOT coordinate.
Options:
-f The full path file or directory containing raw xxx.sam(.gz/.zip OK)/.bam file(s)
paired alignments that are sorted by query name (standard novoalign output).
Multiple files will be merged.
Default Options:
-s Save file, defaults to that inferred by -f. If an xxx.sam extension is provided,
the alignments won't be sorted by coordinate and saved as a bam file.
-a Maximum alignment score (AS:i: tag). Defaults to 120, smaller numbers are more
stringent. Approx 30pts per mismatch for novoalignments.
-q Minimum mapping quality score, defaults to 13, larger numbers are more stringent.
Set to 0 if processing splice junction indexed RNASeq data.
-r The second paired alignment's strand is reversed. Defaults to not reversed.
-d Maximum acceptible base pair distance for merging, defaults to 5000.
-m Don't cross check read mate coordinates, needed for merging repeat matches. Defaults
to checking.
-o Merge all proper paired alignments. Defaults to only merging those that overlap.
-k Skip merging paired alignments. Defaults to merging. Useful for testing effect of
merging on downstream analysis.
Example: java -Xmx1500M -jar pathToUSeq/Apps/MergePairedSamAlignments -f /Novo/Run7/
-c -s /Novo/STPParsedBams/run7.bam -d 10000
**************************************************************************************
**************************************************************************************
** Merge Point Data: Jan 2011 **
**************************************************************************************
Efficiently merges PointData, collapsing by position and possibly strand. Identical
position scores are either summed or converted into counts. DO NOT use this app on
PointData that will be part of a primary chIP/RNA-seq analysis. It is only for
bis-seq and visualization purposes.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. Alternatively, provide one directory
containing multiple PointData directories.
-s Save directory, full path.
-c Don't replace scores with hit count, just sum existing scores.
-m Merge strands
Example: java -Xmx1500M -jar pathTo/USeq/Apps/MergePointData -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedEts1 -m
**************************************************************************************
**************************************************************************************
** Merge Regions: May 2009 **
**************************************************************************************
Flattens tab delimited bed files (chr start stop ...). Assumes interbase coordinates.
Options:
-d Directory containing bed files.
Example: java -Xmx4000M -jar pathTo/Apps/MergeRegions -d /Anno/TilingDesign/
************************************************************************************
**************************************************************************************
** Merge UCSC Gene Table: Feb 2013 **
**************************************************************************************
Merges transcript models that share the same gene name (in column 0). Maximizes exons,
minimizes introns. Assumes interbase coordinates.
Options:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (geneName name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds).
Example: java -Xmx4G -jar pathTo/USeq/Apps/MergeUCSCGeneTable -u
/data/zv8EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Methylation Array Scanner: May 2014 **
**************************************************************************************
MAS takes paired or non-paired sample PointData representing beta values (0-1) from
arrays and scores regions with enriched/ reduced signal using a sliding window
approach. A B&H corrected Wilcoxon signed rank (or rank sum test for non-paired),
pseudo median of the log2(treat/control) ratios (or log2(pseT/pseC) for non-paired),
and permutation test FDR is calculated for each window. Use the EnrichedRegionMaker
to identify enriched and reduced regions by picking thresholds (e.g. -i 0,1 -s 0.2,13).
MAS generates several data tracks for visualization in IGB including paired sample bp
log2 ratios, window level Wilcoxon FDRs, and window level pseudomedian log2 ratios.
Note, non-paired analysis are very underpowered and require > 30 obs/ window to see
any significant FDRs.
Required Options:
-s Path to a directory for saving the results.
-d Path to a directory containing individual sample PointData directories, each of
which should contain chromosome split bar files (e.g. chr1.bar, chr2.bar, ...)
-t Names of the treatment sample directories in -d, comma delimited, no spaces.
-c Ditto but for the control samples, the ordering is critical and describes how to
pair the samples for a paired analysis.
Advanced Options:
-n Run a non-paired analysis where t and c are treated as groups and pooled.
-w Window size, defaults to 1000.
-o Minimum number observations in window, defaults to 10.
-p Minimum pseudomedian log2 ratio for estimating the permutation FDR, defaults to 0.2
-r Number permutations, defaults to 5
Example: java -Xmx4G -jar pathTo/USeq/Apps/MethylationArrayScanner -s ~/MAS/Res
-d ~/MAS/Bar/ -t Early1,Early2,Early3 -c Late1,Late2,Late3
-w 1500
**************************************************************************************
**************************************************************************************
** Methylation Array Defined Region Scanner: July 2013 **
**************************************************************************************
MADRS takes paired sample PointData representing beta values (0-1) from arrays and
a list of regions to score for differential methylation using a B&H corrected Wilcoxon
signed rank test and pseudo median of the paired log2(treat/control) ratios. Pairs
containing a zero value are ignored. It generates a spreadsheet of statistics for each
region. If a non-paired analysis is selected, a Wilcoxon rank sum test and
log2(pseT/pseC) are calculated on each region. Note this is a very underpowered test
requiring >30 observations to see any significant FDRs.
Required Options:
-b A bed file of regions to score (tab delimited: chr start stop ...)
-d Path to a directory containing individual sample PointData directories, each of
which should contain chromosome split bar files (e.g. chr1.bar, chr2.bar, ...)
-t Names of the treatment sample directories in -d, comma delimited, no spaces.
-c Ditto but for the control samples, the ordering is critical and describes how to
pair the samples for a paired analysis.
-o Minimum number paired observations in window, defaults to 3.
-z Skip printing regions with less than minimum observations.
-n Run a non-paired analysis where t and c are treated as groups and pooled. Uneven
numbers of t and c are allowed.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MethylationArrayDefinedRegionScanner
-v H_sapiens_Feb_2009 -d ~/MASS/Bar/ -t Early1,Early2,Early3
-c Late1,Late2,Late3
**************************************************************************************
**************************************************************************************
** Microsatellite Counter: Jan 2014 **
**************************************************************************************
MicrosatelliteCounter identifies and counts microsatellite repeats in MiSeq fastq
files. This iteration of the software requires you to specify the primers used in the
sequencing project. It will automatically find the most likely microsatellite by
looking at all possible repeats of length 1 through length 10 and finding the longest
repeat by length, not repeat unit. There are two output files generated, the first
lists primer statistics (currently only reads with both primers are used), the
second lists repeat data. Note that the input file are fastq sequence that were
merged using a program like PEAR
Required Arguments:
-f Merged fastq file. Path to merged fastq file. We currently suggest using PEAR to
merge fastq sequences.
-p Primer file. Path to primer reference file. This file lists each primer used in
in the sequencing project in the format NAME
**************************************************************************************
** MiRNA Correlator: March 2014 **
**************************************************************************************
Generates a spreadsheet to use in comparing changing miRNA levels to changes in gene
expression.
Options:
-r Results file.
-a All miRNA name file (single column of miRNA names).
-m MiRNA data (two columns: miRNA name, miRNA log2Rto).
-t Gene target to miRNA data (two columns: gene target name, miRNA name).
-e Gene expression data (three columns: gene name, log2Rto, FDR).
-f Don't print the gene expression FDR value in the spreadsheet.
Example: java -Xmx4G -jar pathTo/USeq/Apps/MiRNACorrelator -m miRNA_CLvsMOR.txt -a
allMiRNANamesNoPs.txt -t targetGene2MiRNA.txt -e geneExp_CLvsMOR.txt -r results.xls
**************************************************************************************
**************************************************************************************
** Multiple Replica Scan Seqs: May 2014 **
**************************************************************************************
MRSS uses a sliding window and Ander's DESeq negative binomial pvalue -> Benjamini &
Hochberg AdjP statistics to identify enriched and reduced regions in a genome. Both
treatment and control PointData sets are required, one or more biological replicas.
MRSS generates window level differential count tracks for the AdjP and normalized
log2Ratio as well as a binary window objec xxx.swi file for downstream use by the
EnrichedRegionMaker. MRSS also makes use of DESeq's variance corrected count data to
cluster your biological replics. Given R's poor memory management, running DESeq
requires lots of RAM, 64bit R, and 1-3 hrs.
Options:
-s Save directory, full path.
-t Treatment replica PointData directories, full path, comma delimited, no spaces,
one per biological replica. Use the PointDataManipulator app to merge same
replica and technical replica datasets. Each directory should contain stranded
chromosome specific xxx_-/+_.bar.zip files. Alternatively, provide one
directory that contains multiple biological replical PointData directories.
-c Control replica PointData directories, ditto.
-r Full path to 64bit R loaded with DESeq library, defaults to '/usr/bin/R' file, see
http://www-huber.embl.de/users/anders/DESeq/ . Type 'library(DESeq)' in
an R terminal to see if it is installed.
-p Peak shift, average distance between + and - strand peaks for chIP-Seq data, see
PeakShiftFinder or set it to 100bp. For RNA-Seq set it to 0. It will be used
to shift the PointData by 1/2 the peak shift.
-w Window size, defaults to the peak shift. For chIP-Seq data, a good alternative
is the peak shift plus the standard deviation, see the PeakShiftFinder app.
For RNA-Seq data, set this to 100-250.
Advanced Options:
-m Minimum number of reads in a window, defaults to 15
-d Don't delete temp files
Example: java -Xmx4G -jar pathTo/USeq/Apps/MultipleReplicaScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults/ -p 150 -w 250 -b
**************************************************************************************
**************************************************************************************
** Multi Sample VCF Filter : May 2013 **
**************************************************************************************
Filters a vcf file containing multiple sample records into those that pass or fail the
tests below. This works with VCFv4.1 files created by the GATK package. Note, the
records are not modified. If the number of records in the VCF file is greater than
500,000, the VCF file is intersected in chunks. The chunks are merged and compressed
automatically at the end of the application.
Required:
-v Full path to a sorted multi sample vcf file (xxx.vcf/xxx.vcf.gz)).
-p Full path to the output VCF (xxx.vcf/xxx.vcf.gz). Specifying xxx.vcf.gz will
compress and index the VCF using tabix (set -t too).
Optional:
-f Print out failing records, defaults to printing those passing the filters.
-a Fail records where no sample passes the sample thresholds.
-i Fail records where the original FILTER field is not 'PASS' or '.'
-b Filter by genotype flags. -n, -u and -l must be set.
-n Sample names ordered by category.
-u Number of samples in each category.
-l Requirement flags for each category. All samples that pass the specfied filters
must meet the flag requirements, or the variant isn't reported. At least one
sample in each group must pass the specified filters, or the variant isn't
reported.
a) 'W' : homozygous common
b) 'H' : heterozygous
c) 'M' : homozygous rare
d) '-W' : not homozygous common
e) '-H' : not heterozygous
f) '-M' : not homozygous rare
-e Strict genotype matching. If this is selected, records with no-call samples
or samples falling below either minimum sample genotype quality (-g) or
minimum sample read depth (-r) won't be reported. Only samples listed in (-n) will be checked
-d Minimum record QUAL score, defaults to 0, recommend >=20 .
-g Minimum sample genotype quality GQ, defaults to 0, recommend >= 20 .
-r Minimum sample read depth DP, defaults to 0, recommend >=10 .
-s Print sample names and exit.
-t Path to tabix
Example: java -Xmx10G -jar pathTo/USeq/Apps/MultiSampleVCFFilter
-v DEMO.passing.vcf.gz -p DEMO.intersection.vcf.gz -b
-n SRR504516,SRR776598,SRR504515,SRR504517,SRR504483 -u 2,2,1 -l M,H,-M
**************************************************************************************
**************************************************************************************
** Novoalign Bisulfite Parser: Dec 2013 **
**************************************************************************************
Parses Novoalign -b2 and -b4 single and paired bisulfite sequence alignment files into
PointData file formats. Generates several summary statistics on converted and non-
converted C contexts. Flattens overlapping reads in a pair to call consensus bps.
Note: for paired read RNA-Seq data run through the SamTranscriptomeParser first.
Options:
-a Alignment file or directory containing novoalignments in SAM/BAM
(xxx.sam(.zip/.gz OK) or xxx.bam) format. Multiple files are merged.
-f Fasta file directory, chromosome specific xxx.fa/.fasta(.zip/.gz OK) files.
-s Save directory.
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Default Options:
-p Print bed file parsed data.
-x Maximum alignment score. Defaults to 300, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. For RNASeq data, set this to 0.
-b Minimum base quality score for reporting a non/converted C, defaults to 13.
-c Minimum base quality score for reporting a overlapping non/converted C not found
in the other pair, defaults to 13.
-d Remove duplicate reads prior to generating PointData. Defaults to not removing
duplicates.
Example: java -Xmx25G -jar pathToUSeq/Apps/NovoalignBisulfiteParser -x 240 -a
/Novo/Run7/ -f /Genomes/Hg19/Fastas/ -v H_sapiens_Feb_2009 -s /Novo/Run7/NBP
**************************************************************************************
**************************************************************************************
** Novoalign Indel Parser: June 2010 **
**************************************************************************************
Parses Novoalign alignment xxx.txt(.zip/.gz) files for consensus indels, something
currently not supported by the maq apps. Generates a consensus indel allele file,
interbase coordinates, for running through the Alleler application. Also creates two
bed files for the insertions and deletions.
Options:
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-r Full path directory for saving the results.
-p Minimum alignment posterior probability (-10Log10(prob)) of being incorrect,
defaults to 13 (0.05). Larger numbers are more stringent.
-b Minimum effected indel base quality score(s), ditto, defaults to 13.
-u Minimum number of unique reads covering indel, defaults to 2.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignIndelParser -f /Novo/Run7/
-r /Novo/Run7/indelAlleleTable.txt -p 20 -b 20 -u 3
**************************************************************************************
**************************************************************************************
** Novoalign Parser: Jan 2011 **
**************************************************************************************
Parses Novoalign xxx.txt(.zip/.gz) files into center position binary PointData xxx.bar
files, xxx.bed files, and if appropriate, a splice junction bed file. For the later,
create a gene regions bed file and run it through the MergeRegions application to
collapse overlapping transcripts. We recommend using the following settings while
running Novoalign 'novoalign -r0.2 -q5 -d yourDataBase -f your_prb.txt | grep '>chr' >
yourResultsFile.txt'. NP works with native, colorspace, and miRNA novoalignments.
Options:
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-r Full path directory text for saving the results.
-p Posterior probability threshold (-10Log10(prob)) of being incorrect, defaults to 13
(0.05). Larger numbers are more stringent. The parsed scores are delogged and
converted to 1-prob.
-q Alignment score threshold, smaller numbers are more stringent, defaults to 60
-c Chromosome prefix, defaults to '>chr'.
-i Ignore strand when making splice junctions.
-g (Optional) Full path gene region bed file (chr start stop...) containing gene
regions to use in scaling intersecting splice junctions.
-s Just print alignment stats, don't save any data.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignParser -f /Novo/Run7/
-v H_sapiens_Mar_2006 -p 20 -q 30 -r /Novo/Run7/mRNASeq/ -i -g
/Anno/Hg18/mergedUCSCKnownGenes.bed
**************************************************************************************
**************************************************************************************
** Novoalign Paired Parser: January 2009 **
**************************************************************************************
Parses Novoalign paired alignment files xxx.txt(.zip/.gz) into xxx.bed format.
Options:
-f The full path directory/file text of your Novoalign xxx.txt(.zip or .gz) file(s).
-e Exclude half matches with a high quality unmatched pair, defaults to keeping them.
-m Maximum size for paired reads mapping to the same chromosome, defaults to 100000.
-s Splice junction radius, defaults to 34. See the MakeSpliceJunctionFasta app.
Example: java -Xmx1500M -jar pathToUSeq/Apps/NovoalignPairedParser -f /Novo/Run7/
**************************************************************************************
**************************************************************************************
** Oligo Tiler: Oct 2009 **
**************************************************************************************
OT tiles oligos across genomic regions returning their forward and reverse sequences.
Won't tile oligos with non GATC characters, case insensitive. Replaces non GATC chars
in offset regions with 'a'. Note, the defaults are set for generating a 60 mer Agilent
specific tiling microarray design where the first 10bp of the 3' stop are buried in the
matrix and the effective oligo length is 50bp. Adjust accordingly for other platforms.
Options:
-f Fasta file directory, should contain chromosome specific xxx.fasta files.
-r Regions file to tile (tab delimited: chr start stop ...) interbase coordinates.
-o Effective oligo size, defaults to 50.
-s Spacing to place oligos, defaults to 25.
-t Three prime offset, defaults to 10.
-m Minimum size of region to tile, defaults to 20.
-a Print oligo FASTA instead of an Agilent eArray text seq formatted results.
-c Tile CpG (spacing not used, see max gap option).
-g Max gap between adjacent CpGs to include in same oligo, defaults to 8.
-e Split export files by strand instead of alternating strand.
-b Replace 3' stop of oligos with the human 11-nullomer 'ccgatacgtcg'. The first
~10bp don't contribute to hybridization on Agilent arrays.
Example: java -Xmx4000M -jar pathTo/Apps/OligoTiler -s 40 -f /Genomes/Hg18/Fastas/
-r /Designs/cancerArray.bed -p -a
************************************************************************************
**************************************************************************************
** Overdispersed Region Scan Seqs: May 2012 **
**************************************************************************************
WARNING: this application is depreciated and no longer maintained, use the
DefinedRegionDifferentialSeq app instead!
ORSS takes bam alignment files and extracts reads under each region or gene's exons to
calculate several statistics. Makes use of Simon Anders' DESeq R package to with its
negative binomial p-value test to control for overdispersion. A Benjamini-Hochberg FDR
correction is used to control for multiple testing. DESeq is run with and without
variance outlier filtering. A chi-square test of independence between the exon read
count distributions is used to score alternative splicing. Several read count measures
are provided including counts for each replica, FPKMs (# frags per kb of int region
per total mill mapped reads) as well as DESeq's variance adjusted counts(use these for
clustering, correlation, and other distance type analysis). If replicas are provided
either the smallest all pair log2Ratio is reported (default) or the pseudomedian.
Several results files are written: two spread sheets containing all of the genes,
those that pass the thresholds, as well as egr, bed12, and useq region files for
visualization in genome browsers.
Required Options:
-s Save directory.
-t Treatment directory containing one xxx.bam file with xxx.bai index per biological
replica. The BAM files should be sorted by coordinate and have passed Picard
validation. Use the SamTranscriptomeParser to convert your aligned transcriptome
data to genomic coordinates.
-c Control directory, ditto.
-u UCSC RefFlat or RefSeq Gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds). WARNING!!!!!!
This table should contain only one composite transcript per gene. Use the
MergeUCSCGeneTable app to collapse Ensembl transcripts downloaded from UCSC in
RefFlat format.
-b (Or) a bed file (chr, start, stop,...), full path, See,
http://genome.ucsc.edu/FAQ/FAQformat#format1
-v Versioned Genome (ie H_sapiens_Mar_2006, D_rerio_Jul_2010), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Advanced/ Default Options:
-o Don't remove overlapping exons, defaults to filtering gene annotation for overlaps.
-i Score introns instead of exons.
-a Data is stranded. Only collect reads from the same strand as the annotation.
-f Minimum FDR threshold, defaults to 10 (-10Log10(FDR=0.1))
-l Minimum absolute log2 ratio threshold, defaults to 1 (2x)
-e Minimum number mapping reads per region, defaults to 20
-d Don't delete temp files used by DESeq
-p Use a pseudo median log2 ratio in place of the smallest all pair log2 ratios for
scoring the degree of differential expression when replicas are present.
Recommended for experiments with 4 or more replicas.
-r Full path to R loaded with DESeq library, defaults to '/usr/bin/R' file, see
http://www-huber.embl.de/users/anders/DESeq/ . Type 'library(DESeq)' in
an R terminal to see if it is installed.
Example: java -Xmx4G -jar pathTo/USeq/Apps/OverdispersedRegionScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults/ -f 30 -e 30 -u /Anno/mergedZv9EnsemblGenes.ucsc.gz
**************************************************************************************
**************************************************************************************
** Create Exon Summary Metrics : April 2013 **
**************************************************************************************
This script runs a bunch of summary metric programs and compiles the results. It uses
R and LaTex to generate a fancy pdf as an output. Can also genrate html
Required:
-a Alignment statistics from Picard's CollectAlignmentMetrics
-b Alignment counts from USeq's CountChromosome
-c Coverage of CCDS exons from USeq's Sam2USeq
-d Duplication statics from Picard's MarkDuplicates
-e Error rate from USeq's CalculatePerCycleErrorRate
-f Overlap Statistics from USeq's MergePaired Sam Alignment
-o Output file name
Optional
-r Path to R
-l Path to pdflatex
-t Generate html instead
-i Generate dictionary (for pipeline)
-c Coverage file name
Example: java -Xmx1500M -jar pathTo/USeq/Apps/VCFAnnovar -v 9908R.vcf
**************************************************************************************
Alignment file not specified, exiting
**************************************************************************************
** ParseIntersectingAlignments: June 2010 **
**************************************************************************************
Parses bed alignment files for intersecting reads provided another bed file of alleles.
Options:
-s Full path file text for your SNP allele five column bed file (tab delimited chr,
start,stop,text,score,strand)
-a Full path file text for your alignment bed file from the NovoalignParser.
-m Minimum base quality, defaults to 13
Example: java -Xmx1500M -jar pathToUSeq/Apps/ParseIntersectingAlignments
-s /LympAlleles/ex1.bed -a /SeqData/lymphAlignments.bed -m 13
**************************************************************************************
**************************************************************************************
** ParsePointDataContexts: Feb 2011 **
**************************************************************************************
Parses PointData for particular 5bp genomic sequence contexts.
Options:
-s Save directory, full path.
-p PointData directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. These will be merged before splitting by summing overlapping
position scores.
-f Fasta files for each chromosome.
-c Context java regular expression, must be 5bp long, 5'->3', case insensitive, e.g.:
'..CG.' for CG
'..C[CAT]G' for CHG
'..C[CAT][CAT]' for CHH
'..C[CAT].' for nonCG
'..C[^G].' for nonCG
Example: java -Xmx12G -jar pathTo/USeq/Apps/ParsePointDataContexts -c '..CG.' -s
/Data/PointData/CG -f /Genomes/Hg18/Fastas -p /Data/PointData/All/
**************************************************************************************
**************************************************************************************
** PeakShiftFinder: May 2010 **
**************************************************************************************
PeakShiftFinder estimates the bp difference between sense and antisense proximal chIP-
seq peaks. It calculates the shift int two ways: by generating a composite peak from a
set of the top peaks in a dataset and by taking the median shift for the top peaks.
The latter appears more reliable for some datasets. Inspect the results in IGB by
loading the xxx.bar graphs. When in doubt, run ScanSeqs with just your
treatment data setting the peak shift to 0 and window size to 50 and manually inspect
the shift in IGB.
Options:
-t Treatment Point Data directories, full path, comma delimited. These should
contain stranded chromosome specific xxx_+_.bar.zip and xxx_-_.bar.zip files.
-c Control Point Data directories, ditto.
-s Save directory, full path.
Advanced Options:
-e Two chIP samples are provided, no input, scan for reduced peaks too.
-w Window size in bps, defaults to 50.
-a Minimum number window reads, defaults to 10
-d Minimum normalized window score, defaults to 2.5
-r Minimum fold of treatment to control window reads, defaults to 5
-n Number of peaks to merge for composite, defaults to 100
-p Distance off peak center to collect from 5' stop, defaults to 500
-m Distance off peak center to collect from 3' stop, defaults to 1000
Example: java -Xmx1500M -jar pathTo/USeq/Apps/PeakShiftFinder -t
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -c /Data/Input1/,Data/Input2/ -s
/Results/Ets1PeakShiftResults -w 25 -d 5
**************************************************************************************
**************************************************************************************
** Point Data Manipulator: Oct 2010 **
**************************************************************************************
Manipulates PointData to merge strands, shift base positions, replace scores with 1
and sum identical positions. If multiple PointData directories are given, the data is
merged.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. Alternatively, provide one directory
containing multiple PointData directories.
-s Save directory, full path.
-o Replace PointData scores with 1
-d Shift base position XXX bases 3', defaults to 0
-i Sum identical base position scores
-m Merge strands
Example: java -Xmx1500M -jar pathTo/USeq/Apps/PointDataManipulator -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedEts1 -o -i -m
**************************************************************************************
**************************************************************************************
** Primer3 Wrapper: Dec 2006 **
**************************************************************************************
Wrapper for the primer3 application. Extracts sequence, formats for primer3, executes
and parses the output to a spreadsheet. See http://frodo.wi.mit.edu/primer3/
-f Full path file text for your sequence file, tab delimited, sequence in 1st column.
-s Pick small product sizes (45-80bp), defaults to standard (80-150bp)
-p Full path file text for the primer3_core application. Defaults to
/nfs/transcriptome/software/noarch/T2/64Bit_Primer3_1.0.0/src/primer3_core
-m Full path file text for the mispriming library. Defaults to
/nfs/transcriptome/software/noarch/T2/64Bit_Primer3_1.0.0/
cat_humrep_and_simple.cgi.txt
Example: java -jar pathTo/T2/Apps/Primer3Wrapper -f /home/dnix/seqForQPCR.txt -p
/nfs/transcriptome/software/noarch/T2/64Bit_Primer3_1.0.0/src/primer3_core
-m /nfs/transcriptome/software/noarch/T2/64Bit_Primer3_1.0.0/
cat_humrep_and_simple.cgi.txt -s
**************************************************************************************
**************************************************************************************
** Print Select Columns: Sept 2010 **
**************************************************************************************
Spread sheet manipulation.
Required Parameters:
-f Full path file or directory text for tab delimited text file(s)
-i Column indexs to print, comma delimited, no spaces
-n Number of initial lines to skip
-l Print only this last number of lines
-c Column word to append onto the start of each line
-r Append a row number column as the first column in the output
-d Append f ile text onto the start of each line
-s Skip blank lines and those with less than the indicated number of columns.
-a Print all available columns.
Example: java -jar pathTo/T2/PrintSelectColumns -f /TabFiles/ -i 0,3,9 -n 1 -c chr
**************************************************************************************
**************************************************************************************
** QCSeqs: Nov 2009 **
**************************************************************************************
QCSeqs takes directories of chromosome specific PointData xxx.bar.zip files that
represent replicas of signature sequencing data, merges the strands, uses a sliding
window to sum the hits, and calculate Pearson correlation coefficients for the window
sums between each pair of replicas. Only windows with a sum score >= the minimum
are included in the correlation.
-d Split chromosome Point Data directories, full path, comma delimited. (These should
contain chromosome specific xxx.bar.zip files).
-t Temp directory, full path. This will be created and then deleted.
-w Window size in bps, defaults to 500.
-s Window step size in bps, defaults to 250.
-m Minimum window sum score, defaults to 5.
-e (Optional) Provide a full path file name in which to write the window sums.
Example: java -Xmx1500M -jar pathTo/USeqs/Apps/QCSeqs -d /Solexa/PolII/Rep1PntData/,
/Solexa/PolII/Rep2PntData/ -t /Solexa/PolII/TempDelMe -w 1000 -s 250
**************************************************************************************
**************************************************************************************
** Qseq2Fastq: Aug 2010 **
**************************************************************************************
Parses, filters out reads failing QC, compresses and converts single and paired read
qseq files to Illumina fastq format. Does not concatinate tiles.
Required Parameters:
-q Qseq directory. This should contain all of the qseq files for a sequencing run,
multiple lanes, paired and single reads. (e.g. s_5_1_0025_qseq.txt(.gz/zip OK))
Optional Parameters:
-f Fastq save directory. Defaults to the qseq directory.
-a Keep all reads. Defaults to removing those failing the QC flag. Paired reads are
only removed if both reads fail QC.
-p Print full fastq headers. Defaults to using read count.
-d Delete qseq files upon successfull parsing of all files. Be carefull!
-s Silence non error output.
Example: java -Xmx2G -jar pathTo/USeq/Apps/Qseq2Fastq -f /Runs/7/Fastq -q
/Runs/100726_SN141_0265_A207D4ABXX/Data/Intensities/BaseCalls
**************************************************************************************
**************************************************************************************
** Randomize Text File: May 2013 **
**************************************************************************************
Randomizes the lines of a text file(s).
Options:
-f Full path to a text file or directory containing such to randomize. Gzip/zip OK.
-n Number of lines to print, defaults to all.
Example: java -Xmx4G -jar pathTo/Apps/RandomizeTextFile -n 24560 -f
/TilingDesign/oligos.txt.gz
************************************************************************************
**************************************************************************************
** Ranked Set Analysis: Jan 2006 **
**************************************************************************************
RSA performs set analysis (intersection, union, difference) on lists of
genomic regions (tab delimited: chrom, start, stop, score, (optional notes)).
-a Full path file text for the first list of genomic regions.
-b Full path file text for the second list of genomic regions.
-d (Optional) Full path directory containing region files for all pair analysis.
-m Max gap, bps, set negative to force an overlap, defaults to -100
-s Save comparison as a PNG, default is no.
Example: java -jar pathTo/T2/Apps/RankedSetAnalysis -a /affy/nonAmpA.txt -b
/affy/nonAmpB.txt -s
**************************************************************************************
**************************************************************************************
** Read Coverage: Feb 2012 **
**************************************************************************************
Generates read coverage stair-step xxx.bar graph files for visualization in IGB. Will
also calculate per base coverage stats for a given file of interrogated regions and
create a bed file of regions with low coverage based on the minimum number of reads.
By default, graph values are scaled per million mapped reads.
Options:
-p Point Data directories, full path, comma delimited. Should contain chromosome
specific xxx.bar.zip or xxx_-_.bar files. Can also provide one dir containing
PointData dirs.
-s Save directory, full path.
-k Data is stranded, defaults to merging strands while generating graphs.
-a Data contains hit counts due to running it through the MergePointData app.
-r Don't scale graph values. Leave as actual read counts.
-i (Optional) Full path file text for a tab delimited bed file (chr start stop ...)
containing interrogated regions to use in calculating a per base coverage
statistics. Interbase coordinates assumed.
-m Minimum number reads for defining good coverage, defaults to 8. Use this in combo
with the interrogated regions file to identify poor coverage regions.
-b Just calculate stats, skip coverage graph generation.
-l Plus scalar, for stranded RC output, defaults to # plus observations/1000000
-n Minus scalar, for stranded RC output, defaults to # minus observations/1000000
-c Combine scaler, defaults to # observations/1000000
Example: java -Xmx1500M -jar pathTo/USeq/Apps/ReadCoverage -p
/Data/Ets1Rep1/,/Data/Ets1Rep2/ -s /Data/MergedHitTrckEts1 -i
/CapSeqDesign/interrogatedExonsChrX.bed
**************************************************************************************
**************************************************************************************
** Reference Mutator : Aug 2014 **
**************************************************************************************
Takes a directory of fasta chromosome sequence files and converts the reference allele
to the alternate provided by a snp mapping table.
Required:
-f Full path to a directory containing chromosome specific fasta files. zip/gz OK.
-t Full path to a snp mapping table.
-s Full path to a directory to save the alternate fasta files.
Example: java -Xmx10G -jar pathTo/USeq/Apps/ReferenceMutator -f /Hg19/Fastas
-s /Hg19/AltFastas/ -t /Hg19/omni2.5SnpMap.txt
**************************************************************************************
**************************************************************************************
** RNA Editing PileUp Parser: June 2013 **
**************************************************************************************
Parses a SAMTools mpileup output file for refseq A bases that show evidence of
RNA editing via conversion to Gs, stranded. Base fraction editing is calculated for
bases passing the thresholds for viewing in IGB and subsequent clustering with
the RNAEditingScanSeqs app. The parsed PointData can be further processed using the
methylome analysis applications.
Options:
-p Path to a mpileup file (.gz or.zip OK, use 'samtools mpileup -Q 13 -A -B' params).
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-s Save directory, full path, defaults to pileup file directory.
-r Minimum read coverage, defaults to 5.
-t Generate stranded specific reference calls, defaults to non stranded. Required for
stranded down stream analysis.
-m Skip processing chrM.
Example: java -Xmx4G -jar pathTo/USeq/Apps/RNAEditingPileUpParser -t -p
/Pileups/N2.mpileup.gz -v C_elegans_Oct_2010
**************************************************************************************
**************************************************************************************
** RNA Editing Scan Seqs: April 2014 **
**************************************************************************************
RESS attempts to identify clustered editing sites across a genome using a sliding
window approach. Each window is scored for the pseudomedian of the base fraction
edits as well as the probability that the observations occured by chance using a
permutation test based on the chiSquare goodness of fit statistic.
Options:
-s Save directory, full path.
-e Edited PointData directory from the RNAEditingPileUpParser.
These should contain stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories. These will be merged when scanning.
-r Reference PointData directory from the RNAEditingPileUpParser. Ditto.
Advanced Options:
-a Minimum base read coverage, defaults to 5.
-b Minimum base fraction edited to use in analysis, defaults to 0.01
-w Window size, defaults to 50.
-p Minimum window pseudomedian, defaults to 0.005.
-m Minimum number observations in window, defaults to 3.
-t Run a stranded analysis, defaults to non-stranded.
-i Remove base fraction edits that are non zero and represented by just one edited
base.
Example: java -Xmx4G -jar pathTo/USeq/Apps/RNAEditingScanSeqs -s /Results/RESS -p 0.01
-e /PointData/Edited -r /PointData/Reference
**************************************************************************************
**************************************************************************************
** RNASeq: May 2014 **
**************************************************************************************
The RNASeq application is a wrapper for processing RNA-Seq data through a variety of
USeq applications. It uses the DESeq2 package for calling significant differential
expression. 3-4 biological replicas per condition are strongly recommended. See
http://useq.sourceforge.net/usageRNASeq.html for details constructing splice indexes,
aligning your reads, and building a proper gene (NOT transcript) table.
The pipeline:
1) Converts raw sam alignments containing splice junction coordinates into genome
coordinates outputting sorted bam alignemnts.
2) Makes relative read depth coverage tracks.
3) Scores known genes for differential exonic and intronic expression using DESeq2
and alternative splicing with a chi-square test.
4) Identifies unannotated differentially expressed transfrags using a window
scan and DESeq2.
Use this application as a starting point in your transcriptome analysis.
Options:
-s Save directory, full path.
-t Treatment alignment file directory, full path. Contained within should be one
directory per biological replica, each containing one or more raw
SAM (.gz/.zip OK) files.
-c Control alignment file directory, ditto.
-n Data is stranded. Only analyze reads from the same strand as the annotation.
-j Reverse stranded analysis. Only count reads from the opposite strand of the
annotation. This setting should be used for the Illumina's strand-specific dUTP protocol.
-k Flip the strand of the second read pair.
-b Reverse the strand of both pairs. Use this option if you would like the orientation
of the alignments to match the orientation of the annotation in Illumina stranded
dUTP sequencing.
-x Max per base alignment depth, defaults to 50000. Genes containing such high
density coverage are ignored.
-v Genome version (e.g. H_sapiens_Feb_2009, M_musculus_Jul_2007), see UCSC FAQ,
http://genome.ucsc.edu/FAQ/FAQreleases.
-g UCSC RefFlat or RefSeq gene table file, full path. Tab delimited, see RefSeq Genes
http://genome.ucsc.edu/cgi-bin/hgTables, (uniqueName1 name2(optional) chrom
strand txStart txEnd cdsStart cdsEnd exonCount (commaDelimited)exonStarts
(commaDelimited)exonEnds). Example: ENSG00000183888 C1orf64 chr1 + 16203317
16207889 16203385 16205428 2 16203317,16205000 16203467,16207889 . NOTE:
this table should contain only ONE composite transcript per gene (e.g. use
Ensembl genes NOT transcripts). Use the MergeUCSCGeneTable app to collapse
transcripts to genes. See the RNASeq usage guide for details.
-r Full path to R, defaults to '/usr/bin/R'. Be sure to install DESeq2, gplots, and
qvalue Bioconductor packages.
Advanced Options:
-m Combine replicas and run single replica analysis using binomial based statistics,
defaults to DESeq and a negative binomial test.
-a Maximum alignment score. Defaults to 120, smaller numbers are more stringent.
-o Don't delete overlapping exons from the gene table.
-e Print verbose output from each application.
-p Run SAMseq in place of DESeq. This is suggested when you have five or more
replicates in each condition, and not suggested if you have fewer. Note
that it can't be run if you don't have at least two replicates per condition
Example: java -Xmx2G -jar pathTo/USeq/Apps/RNASeq -v D_rerio_Dec_2008 -t
/Data/PolIIMut/ -c /Data/PolIIWT/ -s
/Data/Results/MutVsWT -g /Anno/zv8Genes.ucsc
**************************************************************************************
**************************************************************************************
** RNA Seq Simulator: Aug 2011 **
**************************************************************************************
RSS takes SAM alignment files from RNA-Seq data and simulates over dispersed, multiple
replica, differential, non-stranded RNA-Seq datasets.
Options:
-u UCSC RefFlat or RefSeq gene table file, full path. See,
http://genome.ucsc.edu/cgi-bin/hgTables, (name1 name2(optional) chrom strand
txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds)
-p PointData directories, full path, comma delimited. These should contain parsed
PointData (chromosome specific xxx_-/+_.bar.zip files) from running the
NovoalignParser on all of your novoaligned RNA-Seq data.
-n A full path directory name containing 3 or 4 equally split, randomized alignment
xxx.sam (.zip or .gz) files. One for each replica you wish to simulate. Use the
RandomizeTextFile and FileSplitter apps to generate these.
Default Options:
-g Number of genes to make differentially expressed, defaults to 500
-r Minimum number of mapped reads to include a gene in the differential expression
defaults to 50.
-a Smallest skew factor for differential expression, defaults to 0.2
-b Largest skew factor for differential expression, defaults to 0.8
-c Smallest excluded skew factor for differential expression and for overdispersion,
defaults to 0.45
-d Largest excluded skew factor for differential expression and for overdispersion,
defaults to 0.55
-o Don't overdisperse datasets, defaults to overdispersing data using -c and -d params.
-s Skip intersecting genes.
Example: java -Xmx12G -jar pathTo/USeq/Apps/RNASeqSimulator -u
/anno/hg19RefFlatKnownGenes.ucsc.txt -p /Data/Heart/MergedPointData/ -n
/Data/Heart/SplitSAM/ -s 46 -r 15 -g 1000
**************************************************************************************
**************************************************************************************
** Sam 2 Fastq: March 2012 **
**************************************************************************************
Extracts the original Illumina fastq data from single or paired end sam alignments.
Assumes alignments and reads are in the same order. In novoalign, set -oSync .
Options:
-a Sam alignment txt file, full path, .gz/.zip OK.
-f First read fastq file, ditto.
-s (Optional) Second read fastq file, from paired read sequencing, ditto.
Example: java -Xmx1G -jar pathToUSeq/Apps/Sam2Fastq -a /SAM/unaligned.sam.gz -f
/Fastq/X1_110825_SN141_0377_AD06YNACXX_1_1.txt.gz -s
/Fastq/X1_110825_SN141_0377_AD06YNACXX_1_2.txt.gz
**************************************************************************************
**************************************************************************************
** Sam 2 USeq : May 2014 **
**************************************************************************************
Generates per base read depth stair-step graph files for genome browser visualization.
By default, values are scaled per million mapped reads with no score thresholding. Can
also generate a list of regions that pass a minimum coverage depth.
Required Options:
-f Full path to a bam or a sam file (xxx.sam(.gz/.zip OK) or xxx.bam) or directory
containing such. Multiple files are merged.
-v Versioned Genome (ie H_sapiens_Mar_2006, D_rerio_Jul_2010), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
Default Options:
-s Generate strand specific coverage graphs.
-m Minimum mapping quality score. Defaults to 0, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect.
-a Maximum alignment score. Defaults to 1000, smaller numbers are more stringent.
-r Don't scale graph values. Leave as actual read counts.
-e Scale repeat alignments by dividing the alignment count at a given base by the
total number of genome wide alignments for that read. Repeat alignments are
thus given fractional count values at a given location. Requires that the IH
tag was set.
-b Path to a region bed file (tab delim: chr start stop ...) to use in calculating
read coverage statistics. Be sure these do not overlap! Run the MergeRegions app
if in doubt.
-p Path to a file for saving per region coverage stats. Defaults to variant of -b.
-c Print regions that meet a minimum # counts, defaults to 0, don't print.
-l Print regions that also meet a minimum length, defaults to 0.
-o Path to log file. Write coverage statistics to a log file instead of stdout.
-k Make average alignment length graph instead of read depth.
Example: java -Xmx1500M -jar pathTo/USeq/Apps/Sam2USeq -f /Data/SamFiles/ -r
-v H_sapiens_Feb_2009 -b ccdsExons.bed.gz
**************************************************************************************
**************************************************************************************
** Sam Alignment Extractor: Jan 2013 **
**************************************************************************************
Given a bed file containing regions of interest, parses all of the intersecting sam
alignments.
Options:
-a Alignment directory containing one or more xxx.bam files with their associated
xxx.bai indexs sorted by coordinate.
-b A bed file (chr, start, stop,...), full path, see,
http://genome.ucsc.edu/FAQ/FAQformat#format1
-s Optional File for saving extracted alignments, must end in .sam. Defaults to a
permutation of the bed file.
-i Minimum read depth, defaults to 1
-x Maximum read depth, defaults to unlimited
Example: java -Xmx4G -jar pathTo/USeq/Apps/SamAlignmentExtractor -a
/Data/ExonCaptureAlignmentsX1/ -b /Data/SNPCalls/9484X1Calls.bed.gz -x
/Data/9484X1Calls.sam
**************************************************************************************
**************************************************************************************
** Sam Comparator : July 2014 **
**************************************************************************************
Compares coordinate sorted, unique, alignment sam/bam files. Splits alignments into
those that match chrom and position or mismatch.
Required:
-a Full path sam/bam file name. zip/gz OK.
-b Full path sam/bam file name. zip/gz OK.
-s Full path to a directory to save the results.
-p Print paired mismatches to screen.
Example: java -Xmx10G -jar pathTo/USeq/Apps/SamComparator -a /hg19/ref.sam.gz
-b /hg19/alt.sam.gz -s /hg19/SplitAlignments/
**************************************************************************************
**************************************************************************************
** Sam Parser: June 2013 **
**************************************************************************************
Parses SAM and BAM files into alignment center position PointData xxx.bar files.
For RNASeq data, first run the SamTranscriptomeParser to convert splice junction
coordinates to genomic coordinates and set -m to 0 below.
Options:
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-f The full path file or directory containing xxx.sam(.gz/.zip OK) or xxx.bam file(s).
Multiple files will be merged.
-r Full path directory for saving the results.
-m Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect. For RNA-Seq data from the SamTranscriptomeParser, set this to 0.
-a Maximum alignment score. Defaults to 60, smaller numbers are more stringent.
Example: java -Xmx1500M -jar pathToUSeq/Apps/SamParser -f /Novo/Run7/
-v C_elegans_May_2008 -m 0 -a 120
**************************************************************************************
**************************************************************************************
** Sam Transcriptome Parser: Dec 2013 **
**************************************************************************************
STP takes SAM alignment files that were aligned against chromosomes and extended
splice junctions (see MakeTranscriptome app), converts the coordinates to genomic
space and sorts and saves the alignments in BAM format. Although alignments don't need
to be sorted by chromosome and position, it is assumed all the alignments for a given
fragment are grouped together.
Options:
-f The full path file or directory containing raw xxx.sam(.gz/.zip OK) file(s).
Multiple files will be merged.
Default Options:
-s Save file, defaults to that inferred by -f. If an xxx.sam extension is provided,
the alignments won't be sorted by coordinate or saved as a bam file.
-a Maximum alignment score. Defaults to 90, smaller numbers are more stringent.
Approx 30pts per mismatch.
-m Minimum mapping quality score, defaults to 0 (no filtering), larger numbers are
more stringent. Only applies to genomic matches, not splice junctions. Set to 13
or more to require near unique alignments.
-x Maximum mapping quality, reset reads with a mapping quality greater than the max to
this max.
-n Maximum number of locations each read may align, defaults to 1 (unique matches).
-d If the maximum number of locations threshold fails, save one randomly picked repeat
alignment per read.
-r Reverse the strand of the second paired alignment. Reversing the strand is
needed for proper same strand visualization of paired stranded Illumina data.
-b Reverse the strand of both pairs. Use this option if you would like the orientation
of the alignments to match the orientation of the annotation in Illumina stranded
UTP sequencing.
-u Save unmapped reads and those that fail the alignment score.
-c Don't remove chrAdapt and chrPhiX alignments.
-j Only print splice junction alignments, defaults to all.
-p Merge proper paired unique alignments. Those that cannot be unambiguously merged
are left as pairs. Recommended to avoid double counting errors and increase
base calling accuracy. For paired Illumina UTP data, use -p -r -b .
-q Maximum acceptable base pair distance for merging, defaults to 300000.
-h Full path to a txt file containing a sam header, defaults to autogenerating the
header from the read data.
Example: java -Xmx1500M -jar pathToUSeq/Apps/SamTranscriptomeParser -f /Novo/Run7/
-m 20 -s /Novo/STPParsedBams/run7.bam -p -r
**************************************************************************************
**************************************************************************************
** Sam Fixer: August 2011 **
**************************************************************************************
Parses, filters, merges, and fixes xxx.sam files.
Options:
-f The full path file or directory containing xxx.sam(.gz/.zip OK) file(s). Multiple
files will be merged.
-s Full path file name for saving the fixed sam file.
Default Options:
-m Minimum mapping quality score. Defaults to 0, bigger numbers are more stringent.
This is a phred-scaled posterior probability that the mapping position of read
is incorrect.
-a Maximum alignment score. Defaults to 1000, smaller numbers are more stringent.
-d Don't strip optional MD fields from alignments, defaults to removing these.
-u Remove unmapped reads.
-q Don't remove poor quality reads.
-c Convert splice-junctions to genomic coordinates, by providing a splice junction
radius. Only works for single read RNA-Seq data where a splice junction fasta
file was included in the alignments from the USeq MakeSpliceJunctionFasta app.
This does NOT work for paired RNA-Seq data.
Example: java -Xmx1500M -jar pathToUSeq/Apps/SamParser -f /Novo/Run7/
-m 20 -a 120 -s /Novo/Run7/mergedFixed.sam -c 46 -u
**************************************************************************************
**************************************************************************************
** SamReadDepthSubSampler: Feb 2014 **
**************************************************************************************
Filters, randomizes, subsamples each coordinate sorted bam alignment file to a target
base level read depth. Useful for reducing extreem read depths over localized areas.
Options:
-a Alignment file or directory containing coordinate sorted xxx.bam files. Each is
processed independently.
-t Target read depth.
Default Options:
-p Keep read groups together. Causes greater variation in depth.
-x Maximum alignment score. Defaults to 300, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
For RNASeq data, set this to 0.
Example: java -Xmx25G -jar pathToUSeq/Apps/SamReadDepthSubSampler -x 240 -q 20 -a
/Novo/Run7/ -n 100
**************************************************************************************
**************************************************************************************
** Sam SV Filter: March 2014 **
**************************************************************************************
Filters SAM records based on their intersection with a list of target regions for
structural variation analysis. Paired alignments are kept if they align to at least
one target region. These are split into those that align to different targets (span),
the same target with sufficient softmasking (soft), or one target and somewhere else
(single).
Options:
-a Alignment file or directory containing NAME sorted SAM/BAM files. Multiple files
are processed independantly. Xxx.sam(.gz/.zip) or xxx.bam are OK. Assumes only
uniquely aligned reads. Remove duplicates with Picard's MarkDuplicates app.
-s Save directory for the results.
-b Bed file (tab delim: chr, start, stop, ...) of target regions interbase coordinates.
Default Options:
-n Mark passing alignments as secondary. Needed for Delly with -n 30 novoalignments.
-d Don't coordinate sort and index alignments.
-x Maximum alignment score. Defaults to 1000, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 5, bigger numbers are more stringent.
-c Chromosomes to skip, defaults to 'chrAdap,chrPhi,chrM,random,chrUn'. Any SAM
record chromosome name that contains one will be failed.
-m Minimum soft masked bases for keeping paired alignments intersecting the same
target, defaults to 10
Example: java -Xmx25G -jar pathTo/USeq_xxx/Apps/SamSVFilter -x 150 -q 13 -a
/Novo/Run7/ -s /Novo/Run7/SSVF/ -c 'chrPhi,_random,chrUn_'
**************************************************************************************
**************************************************************************************
** SamSubsampler: July 2013 **
**************************************************************************************
Filters, randomizes, subsamples and sorts sam/bam alignment files.
Options:
-a Alignment file or directory containing SAM/BAM (xxx.sam(.zip/.gz OK) or xxx.bam).
Multiple files are merged.
-r Results directory.
Default Options:
-n Number of alignments to print, defaults to all passing thresholds.
-s Sort and index output alignments.
-x Maximum alignment score. Defaults to 300, smaller numbers are more stringent.
-q Minimum mapping quality score. Defaults to 13, bigger numbers are more stringent.
For RNASeq data, set this to 0.
Example: java -Xmx25G -jar pathToUSeq/Apps/SamSubsampler -x 240 -q 20 -a
/Novo/Run7/ -s /Novo/Run7/SR -n 10000000
**************************************************************************************
**************************************************************************************
** Scan Seqs: Feb 2012 **
**************************************************************************************
Takes unshifted stranded chromosome specific PointData and uses a sliding window to
calculate several smoothed window statistics. These include a binomial p-value, a
q-value FDR, an empirical FDR, and a Bonferroni corrected binomial p-value for peak
shift strand skew. These are saved as heat map/ stairstep xxx.bar graph files for
direct viewing in the Integrated Genome Browser. The empFDR is only calculated when
scanning for enriched regions. Provide >2x the # of control reads relative to
treatment to prevent significant sub sampling when calculating the empFDR. If control
data is not provided, simple window sums are calculated.
Options:
-s Save directory, full path.
-t Treatment PointData directories, full path, comma delimited. These should
contain unshifted stranded chromosome specific xxx_-/+_.bar.zip files. One
can also provide a single directory that contains multiple PointData
directories.
-c Control PointData directories, ditto.
-p Peak shift, see the PeakShiftFinder app. Average distance between + and - strand
peaks. Will be used to shift the PointData and set the window size.
-r Full path to R loaded with Storey's q-value library, defaults to '/usr/bin/R'
file, see http://genomics.princeton.edu/storeylab/qvalue/
Advanced Options:
-w Window size, defaults to peak shift. A good alternative window size is the
peak shift plus the standard deviation, see the PeakShiftFinder app.
-e Scan for both reduced and enriched regions, defaults to look for only enriched
regions. This turns off the empFDR estimation.
-j Scan only one strand, defaults to both, enter either + or -
-q Don't filter windows using q-value FDR threshold, save all to bar graphs,
defaults to saving those with a q-value < 40%.
-m Minimum number reads in window, defaults to 2. Increasing this threshold will
speed up processing considerably but compromises the q-value estimation.
-f Filter windows with high read control read counts. Don't use if looking for
reduced regions.
-g Control window read count threshold, # stnd devs off median, defaults to 4.
-n Print point graph window representation xxx.bar files.
-a Number treatment observations to use in defining expect and ratio scalars.
-b Number control observations to use in defining expect and ratio scalars.
-u Use read score probabilities (assumes scores are > 0 and <= 1), defaults to
assigning 1 to each read score. Experimental.
Example: java -Xmx4G -jar pathTo/USeq/Apps/ScanSeqs -t
/Data/PolIIRep1/,/Data/PolIIRep2/ -c /Data/Input1/,Data/Input2/ -s
/Data/PolIIResults -w 200 -p 100 -f -g 5
**************************************************************************************
**************************************************************************************
** Shift Annotation Positions: Oct 2010 **
**************************************************************************************
Uses the information in an xxx.shifter.txt file from the ConcatinateFastas app to
shift the annotation to match the coordinates of the concatinated sequence. Good for
working with poorly assembled genomes. Run this multiple times with different shifter
files. All files are assumed to use interbase coordinates.
Options:
-b Full path file name for a xxx.bed formatted annotation file.
-u (OR) Full path file name for a UCSC refflat/ refseq formatted gene table.
-s Full path file name for the xxx.shifter.txt file from the ConcatinateFastas app.
Example: java -Xmx4G -jar pathTo/USeq/Apps/ShiftAnnotationPositions
-u /zv8/ucscRefSeq.txt -f /zv8/BadFastas/chrScaffold.shifter.txt
**************************************************************************************
**************************************************************************************
** SoapV1Parser: Feb 2009 **
**************************************************************************************
Splits and converts Soap version 1 alignment xxx.txt files into center position binary
PointData xxx.bar files. Interbase coordiantes (zero based, stop excluded).
These can be directly viewed in IGB.
-v Versioned Genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases.
-f The full path directory/file text of your Soap xxx.txt(.zip or .gz) file(s).
-r Full path directory text for saving the results.
-x Maximum number of best matches, defaults to 1.
-m Miminum read length, defaults to 17.
-s Sum identical PointData positions. This should not be used for any downstream USeq
applications, only for visualization.
-p Make read length histogram on reads that pass filters, defaults to all.
Example: java -Xmx1500M -jar pathToUSeq/Apps/SoapV1Parser -f /Soap/Run7/
-v H_sapiens_Mar_2006 -x 5 -m 20
**************************************************************************************
**************************************************************************************
** Subtract Regions: May 2009 **
**************************************************************************************
Removes regions and parts there of that intersect the masking region file. Provide
tab delimited bed files (chr start stop ...). Assumes interbase coordinates.
Options:
-m Bed file to use in subtracting/ masking.
-d Directory containing bed files to mask.
Example: java -Xmx4000M -jar pathTo/Apps/SubtractRegions -d /Anno/TilingDesign/
-m /Anno/repeatMaskerHg18.bed
************************************************************************************
**************************************************************************************
** Score Chromosomes: Oct 2012 **
**************************************************************************************
SC scores chromosomes for the presence of transcription factor binding sites. Use the
following options:
-g The full path directory text to the split genomic sequences (i.e. chr2L.fasta,
chr3R.fasta...), FASTA format.
-t Full path file text for the FASTA file containing aligned trimmed examples of
transcription factor binding sites. A log likelihood position specific
probability matrix will be generated from these sequences and used to scan the
chromosomes for hits to the matrix.
-s Score cut off for the matrix. Defaults to the score of the lowest scoring sequence
used in making the LLPSPM.
-p Print hits to screen, default is no.
-v Provide a versioned genome (ie H_sapiens_Mar_2006), see UCSC Browser,
http://genome.ucsc.edu/FAQ/FAQreleases, if you would like to write graph LLPSPM
scores in xxx.bar format for direct viewing in IGB.
Example: java -Xmx4000M -jar pathTo/T2/Apps/ScoreChromosomes -g /my/affy/Hg18Seqs/ -t
/my/affy/fgf8.fasta -s 4.9 -v H_sapiens_Mar_2006
**************************************************************************************
**************************************************************************************
** ScoreParsedBars: Sept 2008 **
**************************************************************************************
For each region finds the underlying scores from the chromosome specific bar files.
Prints the scores as well as their mean . A p-value for each region's score can be
calculated using chromosome, interrogated region, length, # scores, and gc matched
random regions. Be sure to set the -u flag if your scores are log2 values.
-r Full path file text for your region file (tab delimited: chr start stop(inclusive)).
-b Full path directory text for the chromosome specific data xxx.bar files.
-o Bp offset to add to the position coordinates, defaults to 0.
-s Bp offset to add to the stop of each region, defaults to 0.
-u Unlog the bar values, set this flag if your scores are log2 transformed.
-g Estimate a p-value for the score associated with each region. Provide a full path
directory text for chromosome specific gc content boolean arrays. See
ConvertFasta2GCBoolean app. Complete option -i
-i If estimating p-values, provide a full path file text containing the interrogated
regions (tab delimited: chr start stop ...) to use in drawing random regions.
-n Number of random region sets, defaults to 1000.
-d Don't print individual scores to screen.
Example: java -jar pathTo/Apps/ScoreParsedBars -b /BarFiles/Oligos/
-r /Res/miRNARegions.bed -o -30 -s -60 -i /Res/interrRegions.bed
-g /Genomes/Hg18/GCBooleans/
**************************************************************************************
**************************************************************************************
** Score Sequences: July 2007 **
**************************************************************************************
SS scores sequences for the presence of transcription factor binding sites. Use the
following options:
-g The full path FASTA formatted file text for the sequence(s) to scan.
-t Full path file text for the FASTA file containing aligned trimmed examples of
transcription factor binding sites. A log likelihood position specific
probability matrix will be generated from these sequences and used to scan the
sequences for hits to the matrix.
-s Score cut off for the matrix. Defaults to zero.
Example: java -Xmx500M -jar pathTo/T2/Apps/ScoreSequences -g /my/affy/DmelSeqs.fasta
-t /my/affy/zeste.fasta
**************************************************************************************
**************************************************************************************
** Sgr2Bar: Jan 2012 **
**************************************************************************************
Converts xxx.sgr(.zip) files to chromosome specific bar files.
-f The full path directory/file text for your xxx.sgr(.zip or .gz) file(s).
-v Genome version (ie H_sapiens_Mar_2006, M_musculus_Jul_2007), get from UCSC Browser.
-s Strand, defaults to '.', use '+', or '-'
-t Graphs should be viewed as a stair-step, defaults to bar
Example: java -Xmx1500M -jar pathTo/Apps/Sgr2Bar -f /affy/sgrFiles/ -s + -t
-v D_rerio_Jul_2006
**************************************************************************************
**************************************************************************************
** Simulator: Nov 2008 **
**************************************************************************************
Generates chIP-seq simulated sequences for aligning to a reference genome.
-f Directory containing xxx.fasta files with genomic sequence. File names should
represent chromosome names (e.g. chr1.fasta, chrY.fasta...)
-r Results directory
-b Bed file containing repeat locations (e.g. RepeatMasker.bed)
-n Number of spike-ins, defaults to 1000
-g Number of random fragments to generate for each spike-in, defaults to 1000
-s Minimum size of a fragment, defaults to 150
-x Maximum size of a fragment, defaults to 350
-l Length of read, defaults to 26
-e Comma delimited text of per base % error rates, defaults to 0.5,0.528,0.556,...
Example: java -Xmx1500M -jar pathTo/USeq/Apps/Simulator -f /Hg18/Fastas -r /Spikes/
-b /Hg18/Repeats/repMsker.bed -l 36
**************************************************************************************