pbcore.io¶
The pbcore.io
package provides a number of lightweight interfaces
to PacBio data files and other standard bioinformatics file formats.
Preferred usage is to import classes directly from the pbcore.io
package.
The classes within pbcore.io
adhere to a few conventions, in order
to provide a uniform API:
Each data file type is thought of as a container of a Record type; all Reader classes support streaming access by iterating on the reader object, and IndexedBarReader additionally provides random-access to alignments/reads.
For example:
from pbcore.io import * with IndexedBamReader(filename) as f: for r in f: process(r)To make scripts a bit more user friendly, a progress bar can be easily added using the tqdm third-party package:
from pbcore.io import * from tqdm import tqdm with IndexedBamReader(filename) as f: for r in tqdm(f): process(r)The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by “.gz” for all file types) or an open file handle. The reader/writer classes will do what you would expect.
BAM format¶
The BAM format is a standard format described aligned and unaligned
reads. PacBio uses the BAM format exclusively.
For basic functionality, one should use BamReader
;
use IndexedBamReader
API for full index operation support,
which requires the auxiliary PacBio BAM index file (bam.pbi
file).
- class pbcore.io.BamAlignment(bamReader, pysamAlignedRead, rowNumber=None)¶
- DeletionQV(aligned=True, orientation='native')¶
- DeletionTag(aligned=True, orientation='native')¶
- property HoleNumber¶
- IPD(aligned=True, orientation='native')¶
- InsertionQV(aligned=True, orientation='native')¶
- property MapQV¶
- MergeQV(aligned=True, orientation='native')¶
- PulseWidth(aligned=True, orientation='native')¶
- SubstitutionQV(aligned=True, orientation='native')¶
- property barcode¶
- property barcodeName¶
- baseFeature(featureName, aligned=True, orientation='native')¶
Retrieve the base feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”
Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.
- clippedTo(refStart, refEnd)¶
Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.
Warning
This function takes time linear in the length of the alignment.
- property hasPbi¶
- property hqRegionSnr¶
Return the per-channel SNR averaged over the HQ region.
Note
This capability was not available in cmp.h5 files, so use of this property can result in code that won’t work on legacy data.
- property identity¶
- property isCCS¶
- property isForwardStrand¶
- property isMapped¶
- property isReverseStrand¶
- property isTranscript¶
- property isUnmapped¶
- property mapQV¶
- property movieName¶
- property numPasses¶
- property qEnd¶
- property qId¶
- property qLen¶
- property qName¶
- property qStart¶
- property queryEnd¶
- property queryName¶
- property queryStart¶
- read(aligned=True, orientation='native')¶
- property readGroupInfo¶
- readPositions(aligned=True, orientation='native')¶
Returns an array of read positions.
If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.
- property readScore¶
Return the “read score”, a de novo prediction (not using any alignment) of the accuracy (between 0 and 1) of this read.
Note
This capability was not available in cmp.h5 files, so use of this property can result in code that won’t work on legacy data.
- property readType¶
- property reader¶
- reference(aligned=True, orientation='native')¶
- property referenceId¶
- property referenceInfo¶
- property referenceName¶
- referencePositions(aligned=True, orientation='native')¶
Returns an array of reference positions.
If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.
If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.
- property scrapType¶
- property sequencingChemistry¶
- property tId¶
- transcript(orientation='native', style='gusfield')¶
A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.
- unrolledCigar(orientation='native')¶
Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.
- property zScore¶
- class pbcore.io.BamReader(fname, referenceFastaFname=None)¶
Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.
- property index¶
- readsInRange(winId, winStart, winEnd, justIndices=False)¶
- class pbcore.io.IndexedBamReader(fname, referenceFastaFname=None, sharedIndex=None)¶
A IndexedBamReader is a BAM reader class that uses the
bam.pbi
(PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records- atRowNumber(rn)¶
- property identity¶
Fractional alignment sequence identities as numpy array.
- property index¶
- readsInRange(winId, winStart, winEnd, justIndices=False)¶
FASTA Format¶
FASTA is a standard format for sequence data. We recommmend using the FastaTable class, which provides random access to indexed FASTA files (using the conventional SAMtools “fai” index).
- pbcore.io.FastaTable¶
alias of
IndexedFastaReader
- class pbcore.io.FastaRecord(header, sequence)¶
A FastaRecord object models a named sequence in a FASTA file.
- property comment¶
The comment associated with the sequence in the FASTA file, equal to the contents of the FASTA header following the first whitespace
- classmethod fromString(s)¶
Interprets a string as a FASTA record. Does not make any assumptions about wrapping of the sequence string.
- property header¶
The header of the sequence in the FASTA file, equal to the entire first line of the FASTA record following the ‘>’ character.
Warning
You should almost certainly be using “id”, not “header”.
- property id¶
The id of the sequence in the FASTA file, equal to the FASTA header up to the first whitespace.
- property length¶
Get the length of the FASTA sequence
- property name¶
DEPRECATED: The name of the sequence in the FASTA file, equal to the entire FASTA header following the ‘>’ character
- reverseComplement(preserveHeader=False)¶
Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name
- property sequence¶
The sequence for the record as present in the FASTA file. (Newlines are removed but otherwise no sequence normalization is performed).
- class pbcore.io.FastaReader(f)¶
Streaming reader for FASTA files, useable as a one-shot iterator over FastaRecord objects. Agnostic about line wrapping.
Example:
>>> from pbcore.io import FastaReader >>> from pbcore import data >>> filename = data.getTinyFasta() >>> r = FastaReader(filename) >>> for record in r: ... print((record.header, len(record.sequence))) ('ref000001|EGFR_Exon_2', 183) ('ref000002|EGFR_Exon_3', 203) ('ref000003|EGFR_Exon_4', 215) ('ref000004|EGFR_Exon_5', 157) >>> r.close()
- class pbcore.io.FastaWriter(f)¶
A FASTA file writer class
Example:
>>> from pbcore.io import FastaWriter >>> with FastaWriter("output.fasta.gz") as writer: ... writer.writeRecord("dog", "GATTACA") ... writer.writeRecord("cat", "CATTACA")
(Notice that underlying file will be automatically closed after exit from the with block.)
- writeRecord(*args)¶
Write a FASTA record to the file. If given one argument, it is interpreted as a
FastaRecord
. Given two arguments, they are interpreted as the name and the sequence.
FASTQ Format¶
FASTQ is a standard format for sequence data with associated quality scores.
- class pbcore.io.FastqRecord(header, sequence, quality=None, qualityString=None)¶
A
FastqRecord
object models a named sequence and its quality values in a FASTQ file. For reference consult Wikipedia’s FASTQ entry. We adopt the Sanger encoding convention, allowing the encoding of QV values in [0, 93] using ASCII 33 to 126. We only support FASTQ files in the four-line convention (unwrapped). Wrapped FASTQ files are generally considered a bad idea as the @, + delimiters can also appear in the quality string, thus parsing cannot be done safely.- property comment¶
The comment associated with the sequence in the FASTQ file, equal to the contents of the FASTQ header following the first whitespace
- classmethod fromString(s)¶
Interprets a string as a FASTQ record. Only supports four-line format, as wrapped FASTQs can’t easily be safely parsed.
- property header¶
The header of the sequence in the FASTQ file
- property id¶
The id of the sequence in the FASTQ file, equal to the FASTQ header up to the first whitespace.
- property length¶
The length of the sequence
- property name¶
DEPRECATED: The name of the sequence in the FASTQ file
- property quality¶
The quality values, as an array of integers
- property qualityString¶
The quality values as an ASCII-encoded string
- reverseComplement(preserveHeader=False)¶
Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name
- property sequence¶
The sequence for the record as present in the FASTQ file.
- class pbcore.io.FastqReader(f)¶
Reader for FASTQ files, useable as a one-shot iterator over FastqRecord objects. FASTQ files must follow the four-line convention.
- class pbcore.io.FastqWriter(f)¶
A FASTQ file writer class
Example:
>>> from pbcore.io import FastqWriter >>> with FastqWriter("output.fq.gz") as writer: ... writer.writeRecord("dog", "GATTACA", [35]*7) ... writer.writeRecord("cat", "CATTACA", [35]*7)
(Notice that underlying file will be automatically closed after exit from the with block.)
- writeRecord(*args)¶
Write a FASTQ record to the file. If given one argument, it is interpreted as a
FastqRecord
. Given three arguments, they are interpreted as the name, sequence, and quality.
GFF Format (Version 3)¶
The GFF format is an open and flexible standard for representing genomic features.
- class pbcore.io.Gff3Record(seqid, start, end, type, score='.', strand='.', phase='.', source='.', attributes=())¶
Class for GFF record, providing uniform access to standard GFF fields and attributes.
>>> from pbcore.io import Gff3Record >>> record = Gff3Record("chr1", 10, 11, "insertion", ... attributes=[("foo", "1"), ("bar", "2")]) >>> record.start 10 >>> record.foo '1' >>> record.baz = 3 >>> del record.baz
Attribute access using record.fieldName notation raises
ValueError
if an attribute named fieldName doesn’t exist. Use:>>> record.get(fieldName)
to fetch a field or attribute with None default or:
>>> record.get(fieldName, defaultValue)
to fetch the field or attribute with a custom default.
- copy()¶
Return a shallow copy
- classmethod fromString(s)¶
Parse a string as a GFF record. Trailing whitespace is ignored.
- class pbcore.io.GffReader(f)¶
A GFF file reader class
- class pbcore.io.GffWriter(f)¶
A GFF file writer class