HDF5 file
HDF5 is a binary file format that allows for high-speed random access to very large datasets, and includes support for user-defined composite data structures and compression. It is version 5 of HDF. The first version was created in 1987 and it has been used by hundreds of organisations worldwide including NASA, Deutsche Bank, Baylor College of Medicine and ANSTO.
HDF is maintained by The HDF Group and the current
development source code is maintained in a public git repository on their BitBucket
server.
HDF5 version 1.12 is the latest released version as at 2020-05-03-hdf5_1_12.
Source packages for current and previous releases are located at here.
The data stored in the qpileup HDF5 file conceptually fits into 3 categories: * position - which relates to the reference genome, * strand summary - which holds the per-base metrics derived from the reads in the the BAMs added to the HDF5 file, * metadata - which is a log of the bootstrap/add/remove operations that have been applied to the HDF5.
position
Position is stored in the HDF as 1D Scalar datasets - Integer for postion, char (as byte) for base.
Data Elements are chunked (size=10000), with compression level of 1.
| Data Element | Type | Description |
|---|---|---|
| Position | Integer | Offset of this base within the sequence. Should be 1-based so the first base is numbered 1. |
| Reference | Char | Reference base at this position |
strand summary
Each of the following data elements is compiled independently for each
strand so these elements will exist in the HDF5 file in _for (forward)
and _rev (reverse) versions, for example: Aqual_for and Aqual_rev.
Strand Data Elements are created individually as 1D Scalar Datasets. This structure is used due to speed considerations - use of compound datasets or 2D datasets results in much slower run times due to inefficiency of data structure compatabilities with Java/C.
Data Elements are chunked (size=10000), with compression level of 1.
| Data Element | Type | Description |
|---|---|---|
| A | Integer | Count of all the A bases observed |
| C | Integer | Count of all the C bases observed |
| G | Integer | Count of all the G bases observed |
| T | Integer | Count of all the T bases observed |
| N | Integer | Count of all the N bases observed |
| Aqual | Long | Sum of the qualities of all the A bases at this position |
| Cqual | Long | Sum of the qualities of all the C bases at this position |
| Gqual | Long | Sum of the qualities of all the G bases at this position |
| Tqual | Long | Sum of the qualities of all the T bases at this position |
| Nqual | Long | Sum of the qualities of all the N bases at this position |
| MapQual | Long | Sum of the mapping qualities of all reads that provide bases at this position |
| StartAll | Integer | Count of all reads where alignment starts at this base (obeys clipping) |
| StartNondup | Integer | As for StartAll except that we only count non-duplicate reads (obeys clipping) |
| StopAll | Integer | Count of all reads where alignment stops at this base (obeys clipping) |
| DupCount | Integer | Count of reads that were flagged as duplicate and have a base at this position |
| MateUnmapped | Integer | Count of reads at this position that have an unmapped mate |
| CigarI | Integer | Count of reads that have an "I" in the CIGAR string at this position. Only is counted at the first position at which the insertion occurs. Defined as where there is an insertion between this reference position and the next reference position. |
| CigarD | Integer | Count of reads that have an "D" in the CIGAR string at this position |
| CigarS | Integer | Count of reads that have an "S" in the CIGAR string at this position |
| CigarH | Integer | Count of reads that have an "H" in the CIGAR string at this position |
| CigarN | Integer | Count of reads that have an "N" in the CIGAR string at this position (only valid for RNA alignments) |
| CigarD_start | Integer | Count of reads that have an "D" in the CIGAR string that starts at this position |
| CigarS_start | Integer | Count of reads that have an "S" in the CIGAR string that starts at this position |
| CigarH_start | Integer | Count of reads that have an "H" in the CIGAR string that starts at this position |
| CigarN_start | Integer | Count of reads that have an "N" in the CIGAR string that starts at this position |
| LowReadCount | Integer | Count of the number of BAMs that have a low number of reads covering this position. By default LowReadCOunt is set to 10 but the lowreadcount option in thr INI file can be used to set this in bootstrap mode. If a BAM has a lowreadcount at a position, it is not used when calculating HighNonreference base count. |
| ReferenceNo | Integer | Count of the number of bases at this position which are the same as the reference base |
| NonreferenceNo | Integer | Count of the number of bases at this position which are not the same as the reference base |
| HighNonreference | Integer | Count of the number of BAMs that have a high number of non-reference bases at this position. By default this is defined as non-reference bases accounting for at least 20% of the total number of bases for this BAM at this position. The minimum number of bases can be defined in bootstrap mode using the lowreadcount inifile option. The non-reference base percentage minimum threshold can be defined using the percentnonref inifile option during bootstrap. |
metadata
Stored as chunked (size=1) 1D Scalar DS, with compression level of 1. Strings are stored as bytes. Two types of metadata:
Record metadata
A comma-separated string with the following elements:
| Data Element | Description |
|---|---|
| Mode | Mode carried out (bootstrap/add/remove) |
| Date | Date that the Mode was performed |
| Run time | Run time for the analysis |
| Bam | Path of BAM file added/removed |
| Record Count | Number of records in the BAM file |
The following attributes are also associated with the metadata and are added during bootstrap mode and potentially modified in other modes:
bams_added: a count of the bams that have been added in add mode. Is modified during add mode.low_read_count: Thelow_read_countdefault or as set by the option from thebootstrapmode INI file. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.non_reference_threshold: The percentnonref iniFile option. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.
Reference metadata
The reference metadata contains length and name information for the reference genome. The information is added during bootstrap and cannot be modified after this point. It is a comma separated string with the following elements:
| Data Element | Description |
|---|---|
| Sequence | Name of the reference sequence (ie chromosome or contig) |
| Length | Number of base pairs in the sequence Options# |