USeq Usage

This is an outline on how to use USeq to process your ChIP/RNA-seq data.

The applications are designed to be user friendly and require only moderate familiarity with command line programs. When in doubt type 'java -jar pathTo/USeq/Apps/ApplicationName' to pull a menu of options. See the readme file for installation instructions.

Benchtop ChIP Recommendations

Use a robust chIP protocol/ kit such as those from NEB or Kappa ...
Do not use any carrier/ blocking DNA (i.e. salmon sperm) in your chIP reaction!
If known positives and negatives are available, perform qPCR to demonstrate enrichment for these regions and optimize your chIP protocol. In the end, your data is only as good as your antibody.
It is absolutely critical that every ChIP-Seq experiment include a positive control, we recommend H3K4Me3. PolII is also good but doesn't consistently work. Running both would be ideal and a requirement for technicians who are starting out with ChIP-Seq experiments. Barring these controls one cannot troubleshoot issues when they arise nor trust the novel findings. Thus a recommended experiment design is 1 H3K4Me3 chIP, 1 PolII chIP, 3 experiment chIP bio replicas, 3 experiment input control bio replicas. Multiplex all 8 samples in 1 lane single end 50bp HiSeq.
If you are using multiple antibodies to chIP various targets from the same source, only submit only one set of bio replica matched input samples. These will be used to control for systematic bias for all of your chIP samples. If you only are interested in where chIP regions change between two conditions or treatments (e.g. a differential chIP-seq experiment), then there is no need to sequence the input sample. Still collect it and keep it in the freezer. It will be needed to produce static chIP-seq maps. IgG controls are not useful.
For higher eukaryotes, >10-20 million uniquely mapped duplicate free reads per chIP or input sample is recommended for a whole genome analysis.

Required Computer Resources for Running USeq Analysis:

Java 1.6+
R with Love et. al. DESeq2 library. If running ScanSeqs or DefinedRegionScanSeqs you'll need to install Storey's Q-Value library.)
These command line applications have been tested on MacOSX and Linux. They have not been tested on a Windows machine. All the source code, with extensive documentation, is included in the USeq package.
A 64 bit computer with > 2-8 Gigabytes RAM.

Basic ChIP-Seq Low Level Data Analysis

Two types of chIP-seq analysis are supported by USeq, static and dynamic. In a static analysis, a chIP sample is compared to an input sample to map your factor of interest's occupancy across the genome under one condition. In a dynamic analysis, one chIP sample is directly compared to another chIP sample. Here one is asking where factor binding has changed. It is a much more biologically meaningful analysis and identifies likely sites of activity rather than simple occupancy. The samples are also better matched and the statistics more accurate.

Align your sequences to a reference genome using your aligner of choice. We recommend using the Novoaligner because of its posterior probability statistics, gapped alignments, and adapter trimming. Include a fasta file containing a fake chromosome called 'chrAdapter' with adapter sequences in all orientations. Reads aligning to the chrAdapter will be removed.
(Optional but recommended) If you are using a non primate sample, align your sequences to human using Bowtie to estimate the level of contaminating human DNA. This should be < 4% of high quality reads.
Run the ChIPSeq application. It parses raw alignment files, removes duplicate reads with the PointDataManipulator, generates ReadCoverage tracks, estimates the peak shift using the PeakShiftFinder, and lastly, runs the MultipleReplicaScanSeqs and the EnrichedRegionMaker to identify ChIP-Seq peaks.

If you need to customize your ChIP-Seq analysis then run the following manually after completing the prior:

Remove duplicate alignments by selecting the best for each start position using the FilterDuplicateAlignments application. A good dataset will contain > 90% unique reads. One with < 50% unique reads likely went through a PCR bottleneck due to insufficient properly sized chIP DNA. In such cases, it is typically better to remake the library using a scaled up chIP prep than to perform additional sequencing. No, a whole genome amplification prior to making your sequencing library is not recommended.
Convert your alignments into binary PointData using the SamParser, ElandParser, NovoalignParser, or the Tag2Point (bed files) application. PointData is used by USeq for subsequent manipulation.
Remove reads that intersect known false positive regions (e.g. satellite repeats) with the FilterPointData application. Satellite repeats can be parsed from UCSC's RepeatMasker tracks.
For visualization purposes, build ReadCoverage hit tracks. These graphs plot the number of reads per million mapped reads that cover every base and are good to examine when evaluating likely binding peaks.
(Optional) If you have replica data, compare the replicas using QCSeqs to calculate a correlation coefficient. These correlation coefficients tend to be on the low side (0.5-0.75 r^2) compared to what is seen with microarrays where probe effect often drives the r^2 values.
Determine the average peak shift, the distance between the plus and minus strand peaks for your chIP-seq data using the PeakShiftFinder application. Remember, you are sequencing the ends of fragments that are centered about a binding region. Thus a bimodal distribution of mapped reads must be corrected by shifting their locations 3' by 1/2 the fragment length. The PeakShiftFinder application works best with full 10 million read datasets with > 50 strong positives. If you are running a preliminary/ low read dataset or the PeakShiftFinder fails, use a peak shift of 75bp and a window size of 200bp in ScanSeqs. Note, given the small insert size sequencing bias, estimating your peak shift based on the size selected DNA is difficult. If you size selected your DNA at 200-400bp, the majority of your library will actually be 200bp. This varies by experiment and is best estimated empirically from the data.
Run MultipleReplicaScanSeqs to window scan your data and calculate a negative binomial p-value using Anders and Huber's DESeq statistics that controls for overdispersion in your data. The Benjamini and Hochberg p-value -> FDR conversion is applied to control for multiple testing. Both FDR and normalized log2 ratio window/ track graph files are created as well as a binary window data file (xxx.swi) for subsequent use by the EnrichedRegionMaker. With some no replica datasets, if the MultipleReplicaScanSeqs fails to return any enriched/ reduced windows try using the old ScanSeqs with it's binomial p-value statistics.
Merge overlapping windows into enriched (or reduced) regions and print several reports using the EnrichedRegionMaker. A good start is to use two thresholds to produce a list of regions with a 1% FDR and a minimal 2x enrichment/ reduction. Use a cutoff of 20 (a -10Log10(0.01) conversion) for the FDR estimation, and 1 for the Log2(Ratio)). If no regions are produced, either relax the thresholds or ask the application to generate the top 10, 100, and 1000 regions using the FDR index.

Some High Level ChIP-Seq Analysis Options

Find the nearest neighboring genes to a list of binding peaks using FindNeighboringGenes
Intersect your list of binding peaks to another with IntersectRegions
Perform a directed ScanSeqs using a gene table or list of regions, see MultipleReplicaDefinedRegionScanSeqs
Fetch the sequence under each of your binding peaks using FetchGenomicSequences
Score sequences and chromosomes for binding sites with ScoreSequences and ScoreChromosomes
Make primers for qPCR validation using the Primer3Wrapper. Yes, you do have to validate your results.

Basic RNA-Seq Low Level Data Analysis

RNA-Seq analysis is rather complex. The USeq apps are best suited for comparing two different conditions to identify changes in gene expression, gene structure/ splicing, and novel transfrags. See the Annotations directory to see if an appropriate merged non overlapping gene table is available for your genome build. Test datasets are also available.

(NOTE, these methods are DEPRECIATED see RNA-Seq Analysis with Extended Splice Junctions and BAM files.)

Align your sequences to a reference genome with splices using your aligner of choice. We recommend using the Novoaligner because of its posterior probability statistics, gapped alignments, and adapter trimming. Include a fasta file containing a fake chromosome called 'chrAdapter' with adapter sequences in all orientations. Reads aligning to the chrAdapter will be removed. Include a second fasta file containing splice junction sequences generated by running the MakeSpliceJunctionFasta application on a UCSC refflat formatted gene table (tab delimited: name1, name2(optional), chrom, strand, transcriptStart, transcriptEnd, codingSeqStart, codingSeqEnd, exonCount, exonStarts(comma delimited), exonEnds), interbase coordinates. A compilation of UCSC's Known Genes and Ensembl's transcripts are recommended. Duplicates are ignored so pooling of many gene models is recommended. The radius of the sequence extracted from the splice junctions should be matched to the length of the sequencing. Use 34bp radius for 36bp RNA-Seq, 72bp for 76bp RNA-Seq, etc.
(Optional but recommended) If you are using a non primate sample, align your sequences to human using Bowtie to estimate the level of contaminating human RNA. This should be < 4% of high quality reads.
Run the RNASeq application. It parses raw alignment files, generates ReadCoverage tracks, runs ScanSeqs and the EnrichedRegionMaker to identify novel transfrags, and identifies differential gene expression and splicing using the MultipleReplicaDefinedRegionScanSeqs.

If you need to customize your RNA-Seq analysis then run the following manually after completing the prior:

Convert your alignments into binary PointData using the SamParser, ElandParser, NovoalignParser, or the Tag2Point (bed files) application. PointData is used by USeq for subsequent manipulation.
For visualization purposes, build ReadCoverage hit tracks. These graphs plot the number of reads per million mapped reads that cover every base and are good to examine when evaluating likely transfrags.
(Optional) If you have replica data, compare the replicas using QCSeqs to calculate a correlation coefficient. These correlation coefficients tend to be on the low side (0.5-0.75 r^2) compared to what is seen with microarrays where probe effect often drives the r^2 values.
Run the MultipleReplicaDefinedRegionScanSeqs to score genes for differential expression and differential splicing. For each gene in a UCSC refflat formatted table of genes, MRDRSS compares the number of exonic reads in the treatment to the number in the control. The first comparison is to calculate a negative binomial p-value estimating the probability that the gene is differentially expressed using Anders' DESeq package. These p-values are converted to FDRs using the Benjamini-Hochberg method. As with all p-values and FDRs in USeq, they are represented as -10Log10(p-val/FDR) transformed values where 13 = 5%, 20 = 1%, etc. 1% of genes with an FDRs = 20 will be false and not enriched. The accuracy of this FDR estimation has been tested with a simulated overdispersed RNA-Seq data and is quite good, within 2x of the real FDR. A second statistic that is useful in identifying differentially expressed genes is to sort genes based on their log2((sumT+1)/(sumC+1)) ratio with > 1 or < -1 a good starting point after selecting genes that pass a given FDR. Here one is asking for genes that are not only statistically different but also by at least 2 fold. To score genes for differential splicing, a bonferroni corrected chi-square test of independence is employed to look for differences in the distribution of exonic reads between the treatment and control datasets. The p-values should be treated as a ranking statistic since many of the exons contain < 4 reads.
Use the DefinedRegionScanSeqs if you only have a single sample for analysis or if the MultipleReplicaDefinedRegionScanSeqs failed to return any differentially expressed regions. This is generally not recommended since it doesn't control for overdispersion in the count data.
Run MultipleReplicaScanSeqs to window scan your data and generate FDR and log2Ratio window/ track graph data. Set the peak shift to 0 and the window size to 100, also indicate that you want to score reduced regions. MultipleReplicaScanSeqs also produces a binary window data file (xxx.swi) that is used by the EnrichedRegionMaker to produce lists of putative enriched and reduced regions.
Merge overlapping windows into enriched (and in a subsequent run, reduced) regions and print several reports using the EnrichedRegionMaker. A good start is to use two thresholds to produce a list of regions with a 1% FDR and a minimal 2x enrichment/ reduction. Use a cutoff of 20 (a -10Log10(0.01) conversion) for the FDR and 1 for the Log2(Ratio)). If no regions are produced, either relax the thresholds or ask the application to generate the top 10, 100, and 1000 regions using the binomial p-value score. To identify putative novel transfrags, include a file of all know exons (use the ExportExons application with your UCSC gene tables) while running the EnrichedRegionMaker to excluded intersecting windows with different bp buffers.

High Level RNA-Seq Data Analysis Options

One clear task with dynamic RNA-Seq analysis is to generate a list of differentially expressed genes. This can be accomplished using the data contained in the MultipleReplicaDefinedRegionScanSeqs Excel spreadsheet. To identify differentially expressed genes, sort those with both an FDR >= 20 and an absolute log2(ratio) >= 1. These are good starting thresholds. With these lists you can:
1. Look for over representation of GO terms and Kegg pathways using web applications such as GoMiner and DAVID or the KeggPathwayEnrichment application
2. Compare gene lists from multiple experiments or platforms using the IntersectLists application to look for correlated or anti-correlated intersection.
3. Use the CorrelationMaps application to look for physical gene clusters (aka gene expression neighborhoods, chromosome territories).
A second task is to identify genes with changes in gene structure. Take the MultipleReplicaDefinedRegionScanSeqs Excel spreadsheet and sort on the pValDiffDist. These scores represent -10Log10(bonferroni corrected pvalues) from a chi-square test of independence. Treat them as a ranking score and visually examine each gene in IGB. The gene name in the spreadsheet is hot linked to IGB so clicking it will take you to that location in IGB. I like to then use GenoPub to serve DAS2 distributed data to IGB so, when properly configured, clicking the refresh button will load the read coverage, splice junction, and, if available, the paired read track data for that particular gene. A number of false positives come up with this analysis, most are due to contamination of the genes annotation with reads derived from an adjacent annotation. A stranded analysis will resolve some of these issues but not all. Lastly, intron retention is a common theme in some species, use the Excel spreadsheet generated from running the MultipleReplicaDefinedRegionScanSeqs with a table of introns instead of exons.
A third task is to identify novel transfrags. Use the EnrichedRegionMaker and the xxx.swi file from ScanSeqs to identify transfrags that exceed the QValFDR and log2Ratio thresholds. Those that intersect known annotation should be removed by providing a file of known exons to the EnrichedRegionMaker. You can specify a bp buffer to expand the size of each exon and excluded more intersecting windows. This is helpful since the current gene annotations are incomplete and a variety of transcripts are found overlapping the 3' and 5' UTRs. It is a good idea to increase the max gap parameter in the EnrichedRegionMaker to join adjacent transfrags.
Lastly, make primers for qPCR validation using the Primer3Wrapper. Yes, you do have to validate your results.

Bisulfite Sequencing Analysis

See the Bis-Seq user guide.

How to design a microarray for DNA capture and sequencing (Cap-Seq)

First determine the general regions you wish to capture. This could be particular genes, gene regions, whole chromosomes. Pay attention to the genome build of the coordinates and use this throughout.
If these general regions contain duplicates or overlapping regions collapse them using the MergeRegions application.
Create a bed file containing all exons. We prefer to combine UCSC Genes with Ensembl Genes, both are available from UCSC's table browser. Save as a tab delimited text file containing (name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds). Use the ExportExons application to extract the exons +/- 50bp.
Create another bed file with other regions of interest, some folks like to tile 1.5kb of the promoters.
Merge the two bed files using the MergeRegions application.
OK, now you have two bed files, one with your regions of interest, the other containing all exons +/-50 and 1.5kb promoters. Now, use the ParseBedFilesForParticularRegions to extract the exons and promoter regions that fall within your regions of interest.
(Optional) Repeat mask the parsed exons and promoters using the SubtractRegions application. The RepeatMasker file can be downloaded from UCSC's table browser. Save as a simple coordinate bed file (chr start stop). It helps to run the MergeRegions app on the RepeatMasker file first to collapse the number of repeats.
Run the OligoTiler on the repeat masked extracted exons and promoters. Vary the spacing between oligos, minimum 20, maximum 40, to fill your array. Agilent provides many options, the number of available oligos is as follows (1x1M = 967331, 2x400K = 415162, 2x105K = 103507, 4x44K = 43020). Unless you have a very good reason, use the alternating strand oligo tilling designs. No reason to throw out 1/2 your sample!
By default, OligoTiler generates two files containing alternating forward and reverse strand oligo sequences for each tiled region. Use a randomized version of the second file (see RandomizeTextFile) to bring the number of oligos up in the first file to fill out the space on your array.
If you did not repeat mask your parsed regions, align your tiled oligos to your target genome and toss those with multiple alignments (exact match to 2bp mismatch). It's a good idea to intersect (see FilterIntersectRegions) this alignment masked design with the repeat masked design and add the difference to the repeat masked design for complete coverage.
Check your design by uploading all of the bed files into IGB. This is the most critical step, spend 1-2hrs validating every region.
Now brace yourself, the most difficult part of the process, upload your design to Agilent using their eArray website. NimbleGen also provides custom services and their arrays are equally good for Cap-Seq. Your on target capture rates should be ~70-80%. Agilent's 1M custom CGH array has room for 967,331 user defined probes after tossing the optional control probe sets.

Be sure to check out the full listing of USeq applications. They might save you days of coding.

Questions? Comments? Post messages to the USeq Users Email List or, if needed, contact David Nix directly in the Bioinformatics Core.

Join the USeq Users mailing list to stay up to date with the latest USeq developments, tips, and tricks for next generation genomic analysis.

Many thanks to Ken Boucher in the Biostatistics Shared Resource for help with the statistical methods.