qprofiler
qprofiler provides quality control reporting for next-generation sequencing(NGS). qprofiler takes FASTQ, BAM/SAM, FASTA, QUAL,GFF3, MA and VCF files as input and outputs an XML file containing summary statistics tailored to the input file type. If no output file is specified then the default output file name will be used: qprofiler.xml.
While the XML file is useful for extracting values for further analysis, a visual representation of the data is more useful in most cases so another tool qvisualise was created to parse the qprofiler XML files and produce HTML output with embedded graphs via the Charts javascript library developed by Google. qvisualise exists as a standalone program but it is also integrated into qprofiler so a HTML file will always be output by qprofiler unless the --nohtml tag is used. The HTML file name is based on the XML output filename with the extension .html appended.
Installation
qprofiler requires java 21, a machine with multi cores and 5GB of RAM will be ideally.
To do a build of qprofiler, first clone the adamajava repository and move into the adamajava folder. ~~~~{.text} git clone https://github.com/AdamaJava/adamajava cd adamajava git tag ~~~~
pick up a release, eg. "internal-7.a8ab31c.12803". ~~~~{.text} git checkout tags/internal-7.a8ab31c.12803 ~~~~
Run gradle to build qprofiler and its dependent jar files. ~~~~{.text} ./gradlew :qprofiler:build ~~~~ This creates the qprofiler jar file along with dependent jars in the qprofiler/build/flat folder.
Usage
If the release number is before 23.8546a7b, the main jar file name is "qprofiler-1.0.jar". ~~~~{.text} usage: java -jar qprofiler-1.0.jar [option...] -log logfile -loglevel INFO -output outputfile -input inputfile1 -input inputfile2 ... [-ntP 4 -ntC 16] [-exclude coverage,matrices,md,html] [-maxRecords 10000] [-tags XY,XZ,YZ] ~~~~
Release number is 23.8546a7b or after, the main jar file name is "qprofiler.jar". ~~~~{.text} usage: java -jar qprofiler.jar [option...] -log logfile -loglevel INFO -output outputfile -input inputfile1 -input inputfile2 ... [-ntP 4 -ntC 16] [-exclude coverage,matrices,md,html] [-maxRecords 10000] [-tags XY,XZ,YZ] ~~~~
The full option list is describe below but there are 3 options that you should probably specify every time you call qprofiler: --input, --output, and --log. If you have access to a multi-core machine (e.g. a compute node on a cluster) then you should also look at the thread-count parameters: --ntProducer and --ntConsumer if you are processing BAM files.
In general, we would recommend using as many consumer threads as you have cores available (so 16 consumers for a 16-core machine) and with approximately a 1:4 ratio between producer and consumer threads. The producer threads are relatively lightweight and will not occupy a full core each.
For example, to run on a 16 core computer, we would suggest something like:
java -jar qprofiler-1.0.jar \
-input ~/sample_virus.BWA-backtrack.bam \
-log ~/sample_virus.BWA-backtrack.bam.qp.log \
-output ~/sample_virus.BWA-backtrack.bam.qp.xml \
-ntP 4 -ntC 16
The recommendations on counts of consumer and producer threads are empirical so if you are going to do lots of qprofiler work, you should probably do some testing of your own to see what thread counts and ratios work best on your servers or cluster nodes. This is especially important for cluster work where core count is critical - if you request 8 cores, you need to make sure that your threading parameters are dialled to keep qprofiler inside the number of cores you requested. It's also worth noting that hyperthreaded cores can cause the counts to be off - clusters may count each hyperthreaded core as two cores, i.e. capable of running 2 threads, but they will not be as efficient as 2 separate cores so again you will need some empirical testing to see what thread counts and producer/consumer ratios work best for you.
It is also worth noting that it is not unusual to find BAM files that contain headers or reads considered to be invalid by the Picard library which will throw exceptions and cause qprofiler to exit. This is why the default option for --validation is SILENT but this is not an ideal situation. If you are primarily a consumer of BAMs then it's probably OK to mostly operate in SILENT mode but if anything odd happens with your output, you should rerun with STRICT or LENIENT to see if there
are problems with the BAM. If, on the other hand, you are a BAM producer, you should probably use STRICT and if any of your BAMs cause exceptions to be thrown, you should try to fix the underlying causes.
Options
Option Description
------ -----------
--help Shows this help message.
--include Include certain aggregations. Possible
values are "matrices", "coverage"
for BAM files.
--index File containing data to be profiled
(currently limited to BAM/SAM,
FASTQ, FASTA, QUAL, GFF3, MA)
--input File containing data to be profiled
--log File where log output will be directed
(must have write permissions)
--loglevel Logging level required, e.g. INFO,
DEBUG. (Optional) If no parameter is
specified, will default to INFO
--maxRecords <Integer> Only process the first {0} records in
the BAM file.
--nohtml If this option is set, qvisualise will
NOT be called after qprofiler has
been run and so no html output will
be generated
--ntConsumer <Integer> specify how many threads should be
used when processing the input file
(BAM files only)
--ntProducer <Integer> specify how many threads should be
--output File where the output of the qprofiler
should be written to (needs to be an
xml file)
--tags Perform aggregations on user defined
tags (Strings). Example values are
"ZC", "XY", etc.
--tagsChar Perform aggregations on user defined
tags (chars). Example values are
--tagsInt Perform aggregations on user defined
tags (ints). Example values are
--validation How strict to be when reading a SAM or
BAM. Possible values: {STRICT,
LENIENT, SILENT}
--version Print version info.
--include
This mode produces additional visualisations for the Life Technolgies
SOLiD platform. As this platform was abandoned in 2015, this option is
deprecated and the underlying code is not maintained.
It is not thread-safe and will not work correctly
in conjunction with the -ntC and -ntP options.
It will be removed in a subsequent release.
--maxRecords
Specify how many records should be parsed by the qprofiler. Note that
qprofiler will always start at the beginning of a BAM file, meaning that
you will always get the first maxRecords records back. This option is
designed for testing or for when you want a quick look at a BAM and can't
wait for the full file to be processed.
--format
Group VCF records according to user-specified format fields.
--fullBamHeader
By default, only @HD and @SQ lines from the BAM header are added to
the qprofiler2 XML report. The reason for this is that other header
lines may bleed information such as sample and library ids.
We'd like the XML report file to be something that can be freely shared
without risk of exposing sensitive information. By using this option,
the entire BAM header will be placed in the XML report so only use this
option if you have thought through the remifications.
--ntProducer
Optional and only relevant to BAM files. Specifies how many threads (integer) should be used to produce reads from the input file.
--ntConsumer
Optional and only relevant to BAM files. Specifies how many threads (integer) should be used to consume reads from the input file.
--tags
Perform aggregations on user defined tags for BAM files. Example values
are ZC, XY, etc. This option is considered legacy and may be
deprecated in a future release. As the contents of BAM files has
stabilised, custom reporting and visualisations have been created for
the most common and useful tags.
--loglevel
Level at which logging should be applied. Possible values in increasing order of detail are INFO, DEBUG, ALL. At DEBUG level and above, the logging is very granular so you should not use these levels unless you truly are debugging a qprofiler run. optional, defaults to INFO
--validation
How strict to be when reading a SAM or BAM file. Possible values are STRICT,
LENIENT, SILENT and the default is SILENT. This value is passed to the
Picard library as the parameter Validation Stringency
Output
This example output shows XML from running qrofiler against a BAM file.
This is a high level view and for readability, some of the ines have
been wrapped and most of the contents have been elided (...).
<qProfiler finish_time="2017-07-05 22:09:53" run_by_os="Linux" run_by_user="christiX"
start_time="2017-07-05 17:36:42" version="2.0 (1954)">
<BAMReport execution_finished="2017-07-05 22:09:33" execution_started="2017-07-05 17:36:42"
file="/path/where/we/keep/bams/0f443106-e17d-4200-87ec-bd66fe91195f.bam">
<HEADER>...</HEADER>
<SUMMARY>...</SUMMARY>
<SEQ>...</SEQ>
<QUAL>...</QUAL>
<TAG>...</TAG>
<ISIZE>...</ISIZE>
<RNEXT>...</RNEXT>
<CIGAR>...</CIGAR>
<MAPQ>...</MAPQ>
<RNAME_POS>...</RNAME_POS>
<FLAG>...</FLAG>
</BAMReport>
</qProfiler>
Log file
This example log file is from running qrofiler against a BAM file. The
majority of the log file has been elided (...) to save space.
17:36:42.356 [main] EXEC org.qcmg.qprofiler.QProfiler - Uuid c637af74-2c8f-4682-944a-ccd42dd57967
17:36:42.357 [main] EXEC org.qcmg.qprofiler.QProfiler - StartTime 2017-07-05 17:36:42
17:36:42.358 [main] EXEC org.qcmg.qprofiler.QProfiler - OsName Linux
17:36:42.358 [main] EXEC org.qcmg.qprofiler.QProfiler - OsArch amd64
17:36:42.359 [main] EXEC org.qcmg.qprofiler.QProfiler - OsVersion 3.10.0-327.3.1.el7.x86_64
17:36:42.360 [main] EXEC org.qcmg.qprofiler.QProfiler - RunBy christiX
17:36:42.360 [main] EXEC org.qcmg.qprofiler.QProfiler - ToolName qprofiler
17:36:42.361 [main] EXEC org.qcmg.qprofiler.QProfiler - ToolVersion 2.0 (1954)
17:36:42.362 [main] EXEC org.qcmg.qprofiler.QProfiler - CommandLine qprofiler --log /mnt/lustre/home/christiX/qprofiler/colo_829.analysis/qprofiler2.0/output/0f443106-e17d-4200-87ec-bd66fe91195f.bam.qp.xml.log --loglevel INFO --output /mnt/lustre/home/christiX/qprofiler/colo_829.analysis/qprofiler2.0/output/0f443106-e17d-4200-87ec-bd66fe91195f.bam.qp.xml --input /mnt/lustre/working/genomeinfo/sample/c/9/c9a6be94-bdb7-4c0d-a89d-4addbf76e486/aligned_read_group_set/0f443106-e17d-4200-87ec-bd66fe91195f.bam -ntP 4 -ntC 20
17:36:42.363 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaHome /software/java/jdk1.8.0_77/jre
17:36:42.363 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaVendor Oracle Corporation
17:36:42.364 [main] EXEC org.qcmg.qprofiler.QProfiler - JavaVersion 1.8.0_77
17:36:42.365 [main] EXEC org.qcmg.qprofiler.QProfiler - host hpcnode040.adqimr.ad.lan
17:36:42.367 [main] TOOL org.qcmg.qprofiler.QProfiler - Running in multi-threaded mode (BAM files only). No of available processors: 56, no of requested consumer threads: 20, producer threads: 4
17:36:42.415 [main] INFO org.qcmg.qprofiler.QProfiler - processing file /mnt/lustre/working/genomeinfo/sample/c/9/c9a6be94-bdb7-4c0d-a89d-4addbf76e486/aligned_read_group_set/0f443106-e17d-4200-87ec-bd66fe91195f.bam
17:36:42.418 [pool-1-thread-1] INFO org.qcmg.qprofiler.QProfiler - running BamSummarizerMT
17:36:42.770 [pool-1-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT - will create 20 consumer threads
17:36:42.777 [pool-1-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT - waiting for Producer thread to finish (max wait will be 20 hours)
17:36:42.948 [pool-3-thread-2] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr1
17:36:42.969 [pool-3-thread-1] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr2
17:36:42.974 [pool-3-thread-4] INFO org.qcmg.qprofiler.bam.BamSummarizerMT$Producer - retrieving records for sequence: chr3
...
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - StopTime 2017-07-05 22:09:54
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - TimeTaken 04:33:12
22:09:54.595 [main] EXEC org.qcmg.qprofiler.QProfiler - ExitStatus 0