############################
cmph5tools.py Query Examples
############################

This interface is used to both produce data tables as well as new cmp.h5 files. The interface is meant to be somewhat similar to SQL.
At the heart of the new tools is a small query language for extracting alignments and computing statistics over those alignments. 
The three relevant clauses are: ``what``, ``where``, and ``groupBy``. 

--------------------------------------------
Example 1: Produce a sub-sampled cmp.h5 file
--------------------------------------------

Take 50% of the reads: ::

    $ cmph5tools.py select --where "SubSample(rate=.5)" \
    > --outFile ss.cmp.h5 aligned_reads.cmp.h5
    
-------------------------------------------
Example 2: Produce per-Barcode cmp.h5 files
-------------------------------------------

Filter by AverageBarcodeScore: ::

    $ cmph5tools.py select --where "AverageBarcodeScore >= 30" \
    > --groupBy Barcode aligned_reads.cmp.h5
   
--------------------------------------------------
Example 3: Produce tabular data from a cmp.h5 File    
--------------------------------------------------

Grouped Statistics: :: 

    $ cmph5tools.py stats --what "Tbl(q = Percentile(ReadLength, 90), m = Median(Accuracy))" \
    > --groupBy Barcode aligned_reads.cmp.h5 | tail
    bc_88--bc_88      486.40          0.91
    bc_89--bc_89      561.00          0.91
    bc_9--bc_9        479.80          0.90 
    bc_90--bc_90      563.60          0.89
    bc_91--bc_91      554.60          0.91
    bc_92--bc_92      523.00          0.90
    bc_93--bc_93      542.00          0.90
    bc_94--bc_94      518.00          0.90
    bc_95--bc_95      512.20          0.91
    bc_96--bc_96      609.60          0.92

------------------------------------------------------------------------------
Example 4: Query the package to determine the available metrics and statistics
------------------------------------------------------------------------------

Metrics and Statistics: ::

    $ cmph5tools.py listMetrics
    --- Metrics:
    ByFactor[metric, factor, statistic]	
    _MoleculeReadStart	
    _MinSubreadLength	
    _MaxSubreadLength	
    _UnrolledReadLength	
    DefaultWhere	
    DefaultGroupBy	
    TemplateSpan
	The number of template bases covered by the read	
    ReadLength	
    NErrors	
    ReadDuration	
    FrameRate	
    IPD	
    PulseWidth	
    Movie	
    Reference	
    RefIdentifier	
    HoleNumber	
    ReadStart	 
    ReadEnd	
    TemplateStart	
    TemplateEnd	
    MoleculeId	
    MoleculeName	
    Strand	
    AlignmentIdx	
    Barcode	
    AverageBarcodeScore	
    MapQV	
    WhiteList	
    SubSample[rate, n]
       	boolean vector with true occuring at rate rate or nreads = n
 
    --- Statistics:
    Min	
    Max	
    Sum	
    Mean	
    Median	
    Count	
    Percentile[metric, ptile]	
    Round[metric, digits]

-----------------------------------
Example 5: Familiar SQL-like syntax
-----------------------------------

Filter by barcode and group by reference: ::

    $ cmph5tools.py stats --what "Tbl(a=Accuracy,b=Barcode)" \
    > --where "Barcode == 'bc_78--bc_78'"  \
    > --groupBy Reference aligned_reads.cmp.h5
    Group                a          b
    MET_600_t2_2      0.96          bc_78--bc_78          
    MET_600_t2_2      0.82          bc_78--bc_78          
    MET_600_t2_2      0.85          bc_78--bc_78          
    MET_600_t2_2      0.89          bc_78--bc_78          
    MET_600_t2_2      0.87          bc_78--bc_78          
    MET_600_t2_2      0.90          bc_78--bc_78          
    MET_600_t2_2      0.90          bc_78--bc_78          
    MET_600_t2_2      0.94          bc_78--bc_78

--------------------------------------
Example 6: Familiar SQL-like functions
--------------------------------------

Count alignments: ::

    $ cmph5tools.py stats --what "Count(Reference)" \
    > --where "Barcode == 'bc_78--bc_78'" \
    > --groupBy Reference aligned_reads.cmp.h5
    Group             Count(Reference)
    MET_600_t2_2                     8