cmph5tools.py Query Examples

This interface is used to both produce data tables as well as new cmp.h5 files. The interface is meant to be somewhat similar to SQL. At the heart of the new tools is a small query language for extracting alignments and computing statistics over those alignments. The three relevant clauses are: what, where, and groupBy.

Example 1: Produce a sub-sampled cmp.h5 file

Take 50% of the reads:

$ cmph5tools.py select --where "SubSample(rate=.5)" \
> --outFile ss.cmp.h5 aligned_reads.cmp.h5

Example 2: Produce per-Barcode cmp.h5 files

Filter by AverageBarcodeScore:

$ cmph5tools.py select --where "AverageBarcodeScore >= 30" \
> --groupBy Barcode aligned_reads.cmp.h5

Example 3: Produce tabular data from a cmp.h5 File

Grouped Statistics:

$ cmph5tools.py stats --what "Tbl(q = Percentile(ReadLength, 90), m = Median(Accuracy))" \
> --groupBy Barcode aligned_reads.cmp.h5 | tail
bc_88--bc_88      486.40          0.91
bc_89--bc_89      561.00          0.91
bc_9--bc_9        479.80          0.90
bc_90--bc_90      563.60          0.89
bc_91--bc_91      554.60          0.91
bc_92--bc_92      523.00          0.90
bc_93--bc_93      542.00          0.90
bc_94--bc_94      518.00          0.90
bc_95--bc_95      512.20          0.91
bc_96--bc_96      609.60          0.92

Example 4: Query the package to determine the available metrics and statistics

Metrics and Statistics:

$ cmph5tools.py listMetrics
--- Metrics:
ByFactor[metric, factor, statistic]
_MoleculeReadStart
_MinSubreadLength
_MaxSubreadLength
_UnrolledReadLength
DefaultWhere
DefaultGroupBy
TemplateSpan
    The number of template bases covered by the read
ReadLength
NErrors
ReadDuration
FrameRate
IPD
PulseWidth
Movie
Reference
RefIdentifier
HoleNumber
ReadStart
ReadEnd
TemplateStart
TemplateEnd
MoleculeId
MoleculeName
Strand
AlignmentIdx
Barcode
AverageBarcodeScore
MapQV
WhiteList
SubSample[rate, n]
    boolean vector with true occuring at rate rate or nreads = n

--- Statistics:
Min
Max
Sum
Mean
Median
Count
Percentile[metric, ptile]
Round[metric, digits]

Example 5: Familiar SQL-like syntax

Filter by barcode and group by reference:

$ cmph5tools.py stats --what "Tbl(a=Accuracy,b=Barcode)" \
> --where "Barcode == 'bc_78--bc_78'"  \
> --groupBy Reference aligned_reads.cmp.h5
Group                a          b
MET_600_t2_2      0.96          bc_78--bc_78
MET_600_t2_2      0.82          bc_78--bc_78
MET_600_t2_2      0.85          bc_78--bc_78
MET_600_t2_2      0.89          bc_78--bc_78
MET_600_t2_2      0.87          bc_78--bc_78
MET_600_t2_2      0.90          bc_78--bc_78
MET_600_t2_2      0.90          bc_78--bc_78
MET_600_t2_2      0.94          bc_78--bc_78

Example 6: Familiar SQL-like functions

Count alignments:

$ cmph5tools.py stats --what "Count(Reference)" \
> --where "Barcode == 'bc_78--bc_78'" \
> --groupBy Reference aligned_reads.cmp.h5
Group             Count(Reference)
MET_600_t2_2                     8