pbh5tools

pbh5tools is a collection of tools that can manipulate the content or extract data from two types of h5 files:

  • cmp.h5: files that contain alignment information.
  • bas.h5 and pls.h5: files that contain base-call information.

pbh5tools is comprised of two executables: cmph5tools.py and bash5tools.py. At the moment, the cmph5tools.py program provides a rich set of tools to manipulate and analyze the data in a cmp.h5 file. The bash5tools.py provides mechanisms to extract basecall information from bas.h5 files.

Installation

To install pbh5tools, run the following command from the pbh5tools root directory:

python setup.py install

Tool: bash5tools.py

bash5tools.py can extract read sequences and quality values for both Raw and circular consensus sequencing (CCS) readtypes and use create fastq and fasta files.

Usage

usage: bash5tools.py [-h] [--verbose] [--version] [--profile] [--debug]
                     [--outFilePrefix OUTFILEPREFIX]
                     [--readType {ccs,subreads,unrolled}] [--outType OUTTYPE]
                     [--minLength MINLENGTH] [--minReadScore MINREADSCORE]
                     [--minPasses MINPASSES]
                     input.bas.h5

Tool for extracting data from .bas.h5 files

positional arguments:
  input.bas.h5          input .bas.h5 filename

optional arguments:
  -h, --help            show this help message and exit
  --verbose, -v         Set the verbosity level (default: None)
  --version             show program's version number and exit
  --profile             Print runtime profile at exit (default: False)
  --debug               Run within a debugger session (default: False)
  --outFilePrefix OUTFILEPREFIX
                        output filename prefix [None]
  --readType {ccs,subreads,unrolled}
                        read type (ccs, subreads, or unrolled) []
  --outType OUTTYPE     output file type (fasta, fastq) [fasta]

Read filtering arguments:
  --minLength MINLENGTH
                        min read length [0]
  --minReadScore MINREADSCORE
                        min read score, valid only with
                        --readType={unrolled,subreads} [0]
  --minPasses MINPASSES
                        min number of CCS passes, valid only with
                        --readType=ccs [0]

Examples

Extracting all Raw reads from input.bas.h5 without any filtering and exporting to FASTA (myreads.fasta):

python bash5tools.py input.bas.h5 --outFilePrefix myreads --outType fasta --readType Raw

Extracting all CCS reads from input.bas.h5 that have read lengths larger than 100 and exporting to FASTQ (myreads.fastq):

python bash5tools.py --inFile input.bas.h5 --outFilePref myreads --outType fastq --readType CCS --minLength 100

Tool: cmph5tools.py

cmph5tools.py is a multi-commandline tool that provides access to the following subtools:

  1. merge: Merge multiple cmp.h5 files into a single file.
  2. sort: Sort a cmp.h5 file.

3. select: Create a new file from a cmp.h5 file by specifying which reads to include.

4. equal: Compare the contents of 2 cmp.h5 files for equivalence.

5. summarize: Summarize the contents of a cmp.h5 file in a verbose, human readable format.

6. stats: Extract summary metrics from a cmp.h5 file into a csv file.

  1. valid: Determine whether a cmp.h5 file is valid.

8. listMetrics: Emit the available metrics and statistics for use in the select and stats subcommands.

To list all available subtools provided by cmph5tools.py simply run:

cmph5tools.py --help

Each subtool has its own usage information which can be generated by running:

cmph5tools.py <toolname> --help

To run any subtool it is suggested to use the --info commandline argument since this will provide progress information while the script is running via printing in stdout:

cmph5tools.py <toolname> --info <other arguments>