cmph5tools.py Query Examples¶
This interface is used to both produce data tables as well as new cmp.h5 files. The interface is meant to be somewhat similar to SQL.
At the heart of the new tools is a small query language for extracting alignments and computing statistics over those alignments.
The three relevant clauses are: what
, where
, and groupBy
.
Example 1: Produce a sub-sampled cmp.h5 file¶
Take 50% of the reads:
$ cmph5tools.py select --where "SubSample(rate=.5)" \
> --outFile ss.cmp.h5 aligned_reads.cmp.h5
Example 2: Produce per-Barcode cmp.h5 files¶
Filter by AverageBarcodeScore:
$ cmph5tools.py select --where "AverageBarcodeScore >= 30" \
> --groupBy Barcode aligned_reads.cmp.h5
Example 3: Produce tabular data from a cmp.h5 File¶
Grouped Statistics:
$ cmph5tools.py stats --what "Tbl(q = Percentile(ReadLength, 90), m = Median(Accuracy))" \
> --groupBy Barcode aligned_reads.cmp.h5 | tail
bc_88--bc_88 486.40 0.91
bc_89--bc_89 561.00 0.91
bc_9--bc_9 479.80 0.90
bc_90--bc_90 563.60 0.89
bc_91--bc_91 554.60 0.91
bc_92--bc_92 523.00 0.90
bc_93--bc_93 542.00 0.90
bc_94--bc_94 518.00 0.90
bc_95--bc_95 512.20 0.91
bc_96--bc_96 609.60 0.92
Example 4: Query the package to determine the available metrics and statistics¶
Metrics and Statistics:
$ cmph5tools.py listMetrics
--- Metrics:
ByFactor[metric, factor, statistic]
_MoleculeReadStart
_MinSubreadLength
_MaxSubreadLength
_UnrolledReadLength
DefaultWhere
DefaultGroupBy
TemplateSpan
The number of template bases covered by the read
ReadLength
NErrors
ReadDuration
FrameRate
IPD
PulseWidth
Movie
Reference
RefIdentifier
HoleNumber
ReadStart
ReadEnd
TemplateStart
TemplateEnd
MoleculeId
MoleculeName
Strand
AlignmentIdx
Barcode
AverageBarcodeScore
MapQV
WhiteList
SubSample[rate, n]
boolean vector with true occuring at rate rate or nreads = n
--- Statistics:
Min
Max
Sum
Mean
Median
Count
Percentile[metric, ptile]
Round[metric, digits]
Example 5: Familiar SQL-like syntax¶
Filter by barcode and group by reference:
$ cmph5tools.py stats --what "Tbl(a=Accuracy,b=Barcode)" \
> --where "Barcode == 'bc_78--bc_78'" \
> --groupBy Reference aligned_reads.cmp.h5
Group a b
MET_600_t2_2 0.96 bc_78--bc_78
MET_600_t2_2 0.82 bc_78--bc_78
MET_600_t2_2 0.85 bc_78--bc_78
MET_600_t2_2 0.89 bc_78--bc_78
MET_600_t2_2 0.87 bc_78--bc_78
MET_600_t2_2 0.90 bc_78--bc_78
MET_600_t2_2 0.90 bc_78--bc_78
MET_600_t2_2 0.94 bc_78--bc_78
Example 6: Familiar SQL-like functions¶
Count alignments:
$ cmph5tools.py stats --what "Count(Reference)" \
> --where "Barcode == 'bc_78--bc_78'" \
> --groupBy Reference aligned_reads.cmp.h5
Group Count(Reference)
MET_600_t2_2 8