USeq Compressed Binary Data Format 1.0
General Description:
A useq archive is a zip compressed "directory" containing genomic data split by chromosome, strand, and slices of observations.
Its purpose is to provide a means to store and distribute massive genomic datasets using compressed preindexed minimal binary data formats.
The Java code for reading and writing useq data is distributed through the GenoViz and
USeq sourceforge projects.
The Integrated Genome Browser and GenoPub
(a DAS/2 data distribution web app) support useq archives.
The data types currently supported within an archive are:
- Position
- Position, score
- Position, text
- Position, score, text
- Start, stop
- Start, stop, score
- Start, stop, text
- Start, stop, score, text
These cover most of the commonly used genomic data file types (e.g. xxx.bed, xxx.gff, xxx.sgr, xxx.gr, xxx.wig).
Use the Text2USeq or Wig2USeq
applications to convert text genomic data formats into USeq archives.
Likewise, use the USeq2Text application to covert USeq archives into 6
column text bed files (chrom start stop text score strand).
General Guidelines:
- USeq archives are designated by the xxx.useq extension
- Interbase coordinates (first base is zero, last base in a range (start stop) is excluded, length = start - stop)
- Coordinates are relative to the sense + genomic strand
- Start <= stop (orientation is designated by the strand)
- Strand is either +, -, or .
- Each data slice within an archive contains the same data type
- The data within a slice is sorted first by start position and when appropriate, by length, shortest to longest
- Each data slice follows a particular naming convention (e.g. chrX+43455645-43456645-1000.isft, chrY_random.22345-23678-100.i), no spaces:
- chromosome (e.g. chr5, chrX_random, chr4_ctg9_hap1)
- strand (+, -, or .)
- first start bp position
- last start bp position + 1 (the +1 is used to follow the interbase range coordinates specification, starts are included, stops excluded)
- number of observations
- data type (a combination of the following single letters)
- signed 16-bit short = s
- signed 32-bit integer = i
- signed single-precision 32-bit IEEE 754 float = f
- UTF-8 text = t
- Data slices within an archive are in no particular order
- The first "file" within a useq archive is always a text 'archiveReadMe.xxx'.
Format archiveReadMe.xxx
The archiveReadMe.xxx contains three required key=values as well as additional information related to the entire dataset.
The format of the archiveReadMe.txt version is simply comment lines beginning with '#' that are not parsed and key=values delimited by a return, thus one per line.
The first '=' sign in each key = value is used to split the tokens. Keys must not contain '=' signs or white space. White space before and after the '=' is permitted.
When possible, use the reserve key names in the ArchiveInfo.java file and add new ones as needed. At some point an archiveReadMe.xml version with a DTD should be created, volunteers?
There currently are three required reserved keys:
- useqArchiveVersion = 1.0 (only 1.0 at present)
- versionedGenome = H_sapiens_Mar_2006 (the Affymetrix form (species, three letter build month, and year) is prefered for reference genomes)
- dataType = graph, region, sequence, or other (a hint in how to render the data)
Optional reserved keys:
- description =
- originatingDataSource =
- archiveCreationDate =
- units = of the score value
- initialGraphStyle = Bar, Dot, Line, Min_Max_Ave, Stairstep, or HeatMap
- initialColor = hex color value (e.g. #B2B300) for observations
- initialBackground = hex color value for track background
- initialMinY = float for setting the minimum score value
- initialMaxY = float for setting the maximum score value
Inner workings of a data slice serialization
See one of the USeqData files (e.g. RegionScoreTextData.java) for a code example illustrating the folowing.
USeq archives make use start position offsets and region lengths in combination with zip compression to significantly reduces the size of the data.
The advanteges of using a zip archive are numerous and include data compression, random file access to each data slice, an extractible text readme file,
cross platform (Windows, Mac, Linux...) / language support (Java, C++, Python, Perl), and maual manipulation of the archive after creation.
For each data slice:
- The data slice is first scanned to see if shorts can be used for the start position offsets. If the gap between all subsequent start positions is less than 65536, shorts are used. Since Java has only signed types, the range of every short is extended by subtracting 32768.
- Likewise, for start stop data, the lengths are scanned to see if shorts can be used in place of ints for the data slice.
- After writing the archiveReadMe.xxx to a zip stream, the data slices are written using this form:
- A zip entry is begun using the chrX+43455645-43456645-1000.isft naming convention
- A text/string UTF-8 value is written. Currently this is not used and defaults to "".
- The first observation is written. This includes:
- an int representing the real genomic bp position/ start
- if start stop data, the length of the region (stop-start) is written as either an int or a short
- lastly, any other data such as score and text are written for the first observation
- The subsequent observations are written.
- an int or short representing the offset from the prior observation (remember the data are sorted by start position)
- if start stop data, the length of the region (stop-start) is written as either an int or a short
- lastly, any other data such as score and text
- The zip entry is closed.