qbamfilter query
The qbamfilter library is used in almost all AdamaJava tools as a way
of filtering out BAM file records that are not appropriate for the analysis
being conducted.
The heart of qbamfilter is its query language which determines which
records pass and are included in the analysis, and which records fail and
are discarded.
The general form of a qbamfilter query string is:
operator( condition [, condition|query]* )
The query string is comprised of one or more conditions and zero or more
queries joined by operators. Currently there are only two operators
available - and() and or(). Conditions take the form:
key comparator value
There is no operator required if the query contains a single condition.
Conditions are evaluated left to right so standard
short-circuit thinking applies, i.e. in an and() operator, if the first
condition evaluates to FALSE, the rest of the conditions are not evaluated.
This means it is cheaper (and faster) to put the conditions that will
reject the most records first, i.e. on the left.
Example Queries
RNAME =~ chr*
This first example has a single condition RNAME =~ chr*. Remembering
that records are only kept if the qbamfilter string resolves to TRUE,
this query has the effect of rejecting any BAM record where the RNAME
field does not contain the string "chr". This is useful in cases
where your reference contains chromosomes plus additional non-chromosome
sequences, e.g. GL000191.1, and where you wish to ignore the
non-chromosome matches for a particular analysis.
and( RNAME =~ chr*, Cigar_M > 35 )
This query is based on the previous query but with the addition of the
and() operator and a second condition that requires that more than 35
of the bases in the CIGAR string are "M" indicating a match or mismatch.
The effect of this second condition is that records where most of the
bases are clipped or inserted are discarded.
and( RNAME =~ chr*, Cigar_M > 35, or( MAPQ > 20, option_ZM == 1 ) )
This example shows the addition of a query to the and() operator so
now there are 2 conditions and a sub-query. The or() clause shown has
2 conditions and will return TRUE if either of the conditions are
TRUE. The first condition requires that a record's MAPQ score is above
20 and the second condition requires that a user-supplied ZM field is
present and has a valule of 1.
and( RNAME =~ chr*, Cigar_M > 35, or(MAPQ > 50, option_ZM == 1), Flag_DuplicateRead == false )
This query string adds another condition - that the FLAG field
indicates that the record is not a duplicate. Remember that all
conditions need to evaluate to true for the and() operator to evaluate
to TRUE and pass the record, so we have to be careful with each
condition to make sure it is doing what we want. In this case we don't want
any duplicate reads to pass and make it into our analysis so we need
Flag_DuplicateRead == false.
BAM Fields
The tables below list available condition types for SAM/BAM fields.
FLAG
FLAG is a bitmap of properties of the read including some properties of the read's pair for paired-end or mate-pair sequencing.
| Key | Comparator | Value |
|---|---|---|
flag_ReadPairedflag_ProperPairflag_ReadUnmappedflag_Mateunmappedflag_ReadNegativeStrandflag_MateNegativeStrandflag_FirstOfpairflag_SecondOfpairflag_NotprimaryAlignmentflag_ReadFailsVendorQualityflag_DuplicateReadflag_SupplementaryRead |
==!= |
10truefalse |
Note that Values are case insensitive so true, True and TRUE are
all equivalent. Also note that 1 and true are
equivalent and that 0 and false are equivalent.
Example: match duplicate reads
flag_DuplicateRead == true
flag_DuplicateRead != 0
CIGAR
| Key | Comparator | Value |
|---|---|---|
Cigar_MCigar_ICigar_DCigar_NCigar_SCigar_HCigar_P |
==!=>=<=>< |
integer |
For an explanation of the various CIGAR values - M, I, D, etc - see the
BAM specification.
Example 1: match records with a count of matched/mismatched bases greater than 15
Cigar_M >= 16
Cigar_M > 15
Example 2: match records that have at least 1 base inserted or deleted
or( Cigar_I > 0, Cigar_D > 0 )
MAPQ
| Key | Comparator | Value |
|---|---|---|
| MAPQ | ==!=>=<=>< |
integer |
Example: match records with a mapping quality greater than 20:
MAPQ > 20
SEQ
| Key | Comparator | Value |
|---|---|---|
| seq_numberN | ==!=>=<=>< |
integer |
The only property that can be queried here is the count of "N" bases.
Example: match records containing less than 5 N bases
seq_numberN < 5
QUAL
| Key | Comparator | Value |
|---|---|---|
| qual_average | ==!=>=<=>< |
integer |
The only property that can be queried here is the average base quality.
Example: match records where average base quality is less than 20
qual_average < 20
TLEN
| Key | Comparator | Value |
|---|---|---|
| TLEN | ==!=>=<=>< |
integer |
Example: match records with template size arger than 1000
TLEN > 1000
POS
| Key | Comparator | Value |
|---|---|---|
| pos | ==!=>=<=>< |
integer |
Example: match records with start position between 1000 and 2000
inclusive
and( pos >= 1000, pos <= 2000 )
RNAME, RNEXT
| Key | Comparator | Value |
|---|---|---|
| RNAME RNEXT |
==!= |
string (exact match) |
=~!~ |
string (wildcard '*' at start or end of string) |
Example 1: match records that mapped to the X chromosome
RNAME == chr1
Example 2: match records where the paired read maps to a different sequence
RNAME != RNEXT
Example 3: match records where the paired read maps to a different sequence and one of the reads in the pair mapped to chromosome X
and( RNAME != RNEXT, or( RNAME == chrX, RNEXT == X ) )
Optional Fields
BAM records can have an optional 12th field where users can set their
own fields of the form TAG:TYPE:VALUE. The option_XX condition
allows queries to run against optional fields where XX is replaced by
the name of the tag - see the examples below. Three types of comparison
are currently supported: integer logic, exact string matching and string
pattern matching.
| Key | Comparator | Value |
|---|---|---|
| option_<tag> | ==!=>=<=>< |
integer |
==!= |
string (exact match) | |
=~!~ |
string (wildcard '*' at start or end of string) |
Example 1: match records where the tag "ZM" has a value less than 2
option_ZM < 2
Example 2: match records with tag "RG" set to 'Tumor'
option_RG == Tumor
Example 3: match records with tag "ZP" set to 'Z**'
option_ZP == Z**
Example 4: match records where the tag "RG" does not contain the substring "known"
option_RG !~ known
Example 5: match records where the tag "ZP" does not start with 'null'
option_ZP !~ null*
Example 6: match records where the tag "ZP" does not end with 'null'
option_ZP !~ *null
Special conditions
There is a special query condition available for use with the Optional Field MD. MD is almost always present in BAM files and in conjunction with the CIGAR string, it can exactly specify the differences between the read sequence and the reference sequence.
An extra condition using the key MD_mismatch can be used to operate
against a count of the number of mismatched bases in a read as
determined from the MD Optional Field.
| Key | Comparator | Value |
|---|---|---|
| MD_mismatch | ==!=>=<=>< |
integer |
Example: match records where there are less than 4 bases mismatched
against the reference
MD_mismatch < 4