Bisulfite Sequencing Analysis with USeq:
Alignments and PointData Parsing:
- Find a big Linux or Mac OS server with > 24GB RAM (ideally 64GB) and > 500GB of fast disk. These applications are memory intensive. Amazon EC2 rents such computers for < $2.25/ hr.
- Align your bisulfite sequence fastq data using Novocrafts novoalign in bisulfite mode. Pilot whether the '-b 2' or '-b 4 -u 12' option produces the best alignments. For most Illumina libraries the '-b 2' option is OK. For amplicon sequencing we've found the '-b 4 -u 12' critical. The following novoalign parameters are a good starting point for 101bp paired end reads are ' -oSAM -rRandom -t240 -h120 -b2 '. The novoindex should contain a chromosome called chrLambda that contains the demethylated lambda sequence that was spiked into your genomic DNA prior to conversion. This is a critical control for measuring the combine conversion and sequencing error rate.
- Coordinate sort the sam alignments with Picard's SortSam.jar and remove duplicates with Picard's MarkDuplicates.jar for shotgun bisulfite datasets. Output the duplicate reduced results as a bam file with index for viewing in a genome browser. Delete the raw sam files.
- For amplicon datasets, reduce extreme read depth coverage issues using the SamReadDepthSubSampler app with a target of 100.
- Run the USeq NovoalignBisulfiteParser to parse the alignments into four binary PointData sets containing the number of observed converted Cs (Ts - non methylCs) and non-converted Cs (methylCs) at each reference C sequenced in the genome for both the plus and minus strands. Note, the xxx.bar file output can be visualized directly in the Integrated Genome Browser (IGB). Better yet, convert your xxx.bar files to xxx.useq using the Bar2USeq application.
Single Methylome Analysis:
- Run the BisStat application on the parsed PointData from NovoalignBisulfiteParser to generate aggregate methylation statistics and fraction methylation tracks. For amplicon datasets, increase the min read depth threshold to 30. BisStat scores all xxCxx contexts for methylation using two methods. The first is simply a tabulation of the number of converted Cs and non-convertedCs found in a given context in each alignment. No thresholds are set except those used when parsing the text novoalignments. The second method looks at each reference C in a particular context and uses a binomial p-value to test whether that C is methylated. The expect is set to the observed non-converted C rate from the fully un methylated lambda sequence alignments. This expect represents a combination of the fraction of non conversion and sequencing error. P-values were converted to FDRs using the Benjamini & Hochberg (REF) method to control for multiple testing. Each reference C displaying an FDR of 0.01 or less is considered methylated. This latter method is sensitive to the read depth in a given dataset and can only be used when comparing genomes sequenced to a similar depth. The tabulation method is rather insensitive to read depth but assumes a consistent sampling of all xxCxx contexts across the genome between different methylomes.
- BisStat generates several summary graph files in bar format for visualizing the per base fraction methylation and associated FDR for Cs that meet a minimum read coverage, defaults to 8. It also calculates a smoothed window fraction methylation by taking the median of the per base fraction methylation scores.
- To identify regions with a particular fraction methylation, use the BisStatRegionMaker. To examine particular methylation contexts (e.g. mCG, mCHG, mCHH), first parse the methylome PointData with the ParsePointDataContexts application.
- To examine particular regions for their methylation state use the ScoreMethylatedRegions application. It can use randomly selected regions matched for chromosome, region length, number of observations, and GC content to calculate a p-value for differential methylation and fold enrichment over background. Alternatively, one can filter the methylome PointData for particular regions with the FilterPointData application and then pass this filtered data to BisStat to generate the context summary statistics.
- Class average plots of methylation over particular genomic annotations can be generated using the BisSeqAggregatePlotter. This application takes a list of regions (e.g. transcription start sites +/- 5kb) and collects, inverts negative strand region data, scales, and sums the methylome data into a single aggregate region. A per base fraction methylation is calculated for bases with eight or more observations in the aggregate. Lastly, several median smoothed fraction methylation scores are calculated using sliding windows of 5, 10, and 25% max region size. These can then be plotted using Excel or R.
Comparative Methylome Analysis:
- Two applications have been developed to identify regions of differential methylation between methylomes. The first, BisSeq, takes parsed methylome PointData from two experiments and uses a sliding window to score each window for differential methylation using a fisher exact or chi-square test. The p-values are controlled for multiple testing by converting to FDRs using the Benjamini and Hochberg (REF) method. A good start is to take overlapping 1.5kb windows that meet or exceeded an FDR of 30 (0.001) and also show a log2Ratio of 2 (4x fold difference). A spread sheet output of differentially methylated regions is generated as well as the window level FDR and log2 ratio (fraction methylation treatment/ fraction methylation control) data as bar file graphs for visualization in IGB.
- The second application, DefinedRegionBisSeq, functions analogously to BisSeq but scores user defined regions (e.g. gene promoters, CpG islands) for differential methylation.