HDF5 file

HDF5 is a binary file format that allows for high-speed random access to very large datasets, and includes support for user-defined composite data structures and compression. It is version 5 of HDF. The first version was created in 1987 and it has been used by hundreds of organisations worldwide including NASA, Deutsche Bank, Baylor College of Medicine and ANSTO.

HDF is maintained by The HDF Group and the current development source code is maintained in a public git repository on their BitBucket server. HDF5 version 1.12 is the latest released version as at 2020-05-03-hdf5_1_12.

Source packages for current and previous releases are located at here.

The data stored in the qpileup HDF5 file conceptually fits into 3 categories: * position - which relates to the reference genome, * strand summary - which holds the per-base metrics derived from the reads in the the BAMs added to the HDF5 file, * metadata - which is a log of the bootstrap/add/remove operations that have been applied to the HDF5.

position

Position is stored in the HDF as 1D Scalar datasets - Integer for postion, char (as byte) for base.

Data Elements are chunked (size=10000), with compression level of 1.

Data Element Type Description
Position Integer Offset of this base within the sequence. Should be 1-based so the first base is numbered 1.
Reference Char Reference base at this position

strand summary

Each of the following data elements is compiled independently for each strand so these elements will exist in the HDF5 file in _for (forward) and _rev (reverse) versions, for example: Aqual_for and Aqual_rev.

Strand Data Elements are created individually as 1D Scalar Datasets. This structure is used due to speed considerations - use of compound datasets or 2D datasets results in much slower run times due to inefficiency of data structure compatabilities with Java/C.

Data Elements are chunked (size=10000), with compression level of 1.

Data Element Type Description
A Integer Count of all the A bases observed
C Integer Count of all the C bases observed
G Integer Count of all the G bases observed
T Integer Count of all the T bases observed
N Integer Count of all the N bases observed
Aqual Long Sum of the qualities of all the A bases at this position
Cqual Long Sum of the qualities of all the C bases at this position
Gqual Long Sum of the qualities of all the G bases at this position
Tqual Long Sum of the qualities of all the T bases at this position
Nqual Long Sum of the qualities of all the N bases at this position
MapQual Long Sum of the mapping qualities of all reads that provide bases at this position
StartAll Integer Count of all reads where alignment starts at this base (obeys clipping)
StartNondup Integer As for StartAll except that we only count non-duplicate reads (obeys clipping)
StopAll Integer Count of all reads where alignment stops at this base (obeys clipping)
DupCount Integer Count of reads that were flagged as duplicate and have a base at this position
MateUnmapped Integer Count of reads at this position that have an unmapped mate
CigarI Integer Count of reads that have an "I" in the CIGAR string at this position. Only is counted at the first position at which the insertion occurs. Defined as where there is an insertion between this reference position and the next reference position.
CigarD Integer Count of reads that have an "D" in the CIGAR string at this position
CigarS Integer Count of reads that have an "S" in the CIGAR string at this position
CigarH Integer Count of reads that have an "H" in the CIGAR string at this position
CigarN Integer Count of reads that have an "N" in the CIGAR string at this position (only valid for RNA alignments)
CigarD_start Integer Count of reads that have an "D" in the CIGAR string that starts at this position
CigarS_start Integer Count of reads that have an "S" in the CIGAR string that starts at this position
CigarH_start Integer Count of reads that have an "H" in the CIGAR string that starts at this position
CigarN_start Integer Count of reads that have an "N" in the CIGAR string that starts at this position
LowReadCount Integer Count of the number of BAMs that have a low number of reads covering this position. By default LowReadCOunt is set to 10 but the lowreadcount option in thr INI file can be used to set this in bootstrap mode. If a BAM has a lowreadcount at a position, it is not used when calculating HighNonreference base count.
ReferenceNo Integer Count of the number of bases at this position which are the same as the reference base
NonreferenceNo Integer Count of the number of bases at this position which are not the same as the reference base
HighNonreference Integer Count of the number of BAMs that have a high number of non-reference bases at this position. By default this is defined as non-reference bases accounting for at least 20% of the total number of bases for this BAM at this position. The minimum number of bases can be defined in bootstrap mode using the lowreadcount inifile option. The non-reference base percentage minimum threshold can be defined using the percentnonref inifile option during bootstrap.

metadata

Stored as chunked (size=1) 1D Scalar DS, with compression level of 1. Strings are stored as bytes. Two types of metadata:

Record metadata

A comma-separated string with the following elements:

Data Element Description
Mode Mode carried out (bootstrap/add/remove)
Date Date that the Mode was performed
Run time Run time for the analysis
Bam Path of BAM file added/removed
Record Count Number of records in the BAM file

The following attributes are also associated with the metadata and are added during bootstrap mode and potentially modified in other modes:

  • bams_added: a count of the bams that have been added in add mode. Is modified during add mode.
  • low_read_count: The low_read_count default or as set by the option from the bootstrap mode INI file. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.
  • non_reference_threshold: The percentnonref iniFile option. Cannot be modified after bootstrapping and will return an error if this option is different than the one added during bootstrap.

Reference metadata

The reference metadata contains length and name information for the reference genome. The information is added during bootstrap and cannot be modified after this point. It is a comma separated string with the following elements:

Data Element Description
Sequence Name of the reference sequence (ie chromosome or contig)
Length Number of base pairs in the sequence Options#