USeq Compressed Binary Data Format 1.0

General Description:

A useq archive is a zip compressed "directory" containing genomic data split by chromosome, strand, and slices of observations. Its purpose is to provide a means to store and distribute massive genomic datasets using compressed preindexed minimal binary data formats. The Java code for reading and writing useq data is distributed through the GenoViz and USeq sourceforge projects. The Integrated Genome Browser and GenoPub (a DAS/2 data distribution web app) support useq archives.

The data types currently supported within an archive are:

These cover most of the commonly used genomic data file types (e.g. xxx.bed, xxx.gff, xxx.sgr, xxx.gr, xxx.wig). Use the Text2USeq or Wig2USeq applications to convert text genomic data formats into USeq archives. Likewise, use the USeq2Text application to covert USeq archives into 6 column text bed files (chrom start stop text score strand).

General Guidelines:

Format archiveReadMe.xxx

The archiveReadMe.xxx contains three required key=values as well as additional information related to the entire dataset. The format of the archiveReadMe.txt version is simply comment lines beginning with '#' that are not parsed and key=values delimited by a return, thus one per line. The first '=' sign in each key = value is used to split the tokens. Keys must not contain '=' signs or white space. White space before and after the '=' is permitted. When possible, use the reserve key names in the ArchiveInfo.java file and add new ones as needed. At some point an archiveReadMe.xml version with a DTD should be created, volunteers?

There currently are three required reserved keys:

Optional reserved keys:

Inner workings of a data slice serialization

See one of the USeqData files (e.g. RegionScoreTextData.java) for a code example illustrating the folowing. USeq archives make use start position offsets and region lengths in combination with zip compression to significantly reduces the size of the data. The advanteges of using a zip archive are numerous and include data compression, random file access to each data slice, an extractible text readme file, cross platform (Windows, Mac, Linux...) / language support (Java, C++, Python, Perl), and maual manipulation of the archive after creation.

For each data slice: