Extended Splice Junction RNA-Seq Analysis with USeq


To create a novoindex with extended splice junctions:

  1. Create a multi-fasta file containing extended splice junctions matched to your target read length minus 4bp using the USeq MakeTranscriptome app.
    1. Download from UCSC an Ensembl transcript table, use the "output file: selected fields from primary and related tables" to select: name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds name2.
    2. Optional but recommended, toss annotations to non standard chromosomes (e.g. chrUn_XXX, chrXX_ctg5_XXX, chrXX_g100XXX). These contain unassembled and different variations of the reference chromosomes. They add duplicate identical sequence too so that reads will sometimes match to both the variant and the reference and are scored as non unique hits.
    3. Use the USeq PrintSelectColumns app to move the name2 column before name. This places the ensembl gene name first and the transcript name second. Note, Excel won't work here. (e.g. java -jar -Xmx2G ~/AppsUSeq/PrintSelectColumns -i 10,0,1,2,3,4,5,6,7,8,9 -f mm9EnsTrans.ucsc) . Either remove the header line or place a # at it's start.
    4. Run the USeq MakeTranscriptome app to generate extended splice junctions. The splice junction radius should be set to your read length minus 4. This is a time and memory intensive. If needed, split by chromosome and run on a cluster. (e.g. java -jar -Xmx22G ~/USeqApps/MakeTranscriptome -f /Genomes/Mm9/Fastas -u mm9EnsTrans_Corr.ucsc -r 46)
  2. Place the splice junction fasta file in the fasta genome directory and run novoindex. It is a good idea to include a chromosome with adapter sequence combinations as well as a chromosome with phiX sequence. It is also recommended to exclude non standard chromosomes that could potentially provided duplicate identical sequence causing unique matches to look non unique. (e.g. ~/BioApps/novocraft/novoindex mm9EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr.novoindex Fastas/* )
  3. Share your novoindex and splice junction files by uploading your data into an Analysis Report in GNomEx! Novoindexes, gene, and transcript files are available for many species and genome builds, see UBioCore Transcriptomes.


To align and process your Illumina fastq RNA-Seq data:

  1. Align your fastq data using novoalign and the read length matched extended splice junction novoindex. Output the reads in SAM format and allow for 50 repeat matches for each read. Limit the maximum alignment quality to 120 ~ 4 mismatches per alignment. Use grep to toss @SQ lines and those reads that don't align. Also, be sure to test your gzip compressed fastq files. Novoalign doesn't throw an error when encountering a broken gzip file. (e.g. gunzip -t *gz && ~/novocraft/novoalign -o SAM -r All 50 -t 120 -d /scratch/local/mm9EnsTransRad46Num100kMin10SplicesChrPhiXAdaptr.novoindex -f /scratch/local/7410X6_s_5_1_sequence.txt.gz /scratch/local/7410X6_s_5_2_sequence.txt.gz | grep -v ^@SQ | grep chr | gzip > 7410X6_s_5_raw.sam.gz )
Note, these alignments are not ready for use. The splice-junction coordinates need to be converted to genomic coordinates by running the SamTranscriptomeParser or the RNASeq app. (e.g. java -jar -Xmx4G ~/AppsUSeq/SamTranscriptomeParser -a 90 -f 7410X6_s_5_raw.sam.gz Sam headers are created automatically.)


To analyze your processed RNA-Seq data:

The easiest option is to run the RNASeq application on your raw sam alignment files. This executes many USeq applications and is a good starting point.

Custom Analysis:
  1. Convert splice-junction coordinates to genomic coordinates by running the SamTranscriptomeParser (e.g. java -jar -Xmx4G ~/AppsUSeq/SamTranscriptomeParser -a 90 -f 7410X6_s_5_raw.sam.gz Sam headers are created automatically.)
  2. Generate relative read/ depth coverage graphs using the USeq Sam2USeq apps. These can be visualized in IGB (and UCSC Genome Browser if you upload your data to GenoPub) to show the sequence depth over every base.
  3. Run the DefinedRegionDifferentialSeq to score known annotation for differential expression using Loves et. al. DESeq2 package and your processed BAM files.
    1. First create a gene table containing collapsed transcripts using the modified transcript table downloaded from UCSC, see above. This should contain the following columns: geneName transcriptName chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds. Use the USeq MergeUCSCGeneTable app to merge any transcripts that share the same geneName. Exons are maximized, introns minimized. (e.g. java -jar -Xmx4G ~/AppsUSeq/MergeUCSCGeneTable -u mm9EnsTrans_Corr.ucsc )
    2. Install R and the DESeq2 package, see http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
    3. Run the DefinedRegionDifferentialSeq app to score genes (or other defined regions) for differential expression and alternative splicing (e.g. java -Xmx4G -jar ~/AppsUSeq/DefinedRegionDifferentialSeq -c /Data/PolIIConditions -s PolIIResults/ -u /Anno/mm9EnsGenes.ucsc.gz -g H_sapiens_Feb_2009)
  4. Upload your analysis into an Analysis Report in GNomEx and post your data tracks!