pbcore.io

The pbcore.io package provides a number of lightweight interfaces to PacBio data files and other standard bioinformatics file formats. Preferred usage is to import classes directly from the pbcore.io package.

The classes within pbcore.io adhere to a few conventions, in order to provide a uniform API:

  • Each data file type is thought of as a container of a Record type; all Reader classes support streaming access by iterating on the reader object, and IndexedBarReader additionally provides random-access to alignments/reads.

    For example:

    from pbcore.io import *
    with IndexedBamReader(filename) as f:
      for r in f:
          process(r)
    

    To make scripts a bit more user friendly, a progress bar can be easily added using the tqdm third-party package:

    from pbcore.io import *
    from tqdm import tqdm
    with IndexedBamReader(filename) as f:
      for r in tqdm(f):
          process(r)
    
  • The constructor argument needed to instantiate Reader and Writer objects can be either a filename (which can be suffixed by “.gz” for all file types) or an open file handle. The reader/writer classes will do what you would expect.

BAM format

The BAM format is a standard format described aligned and unaligned reads. PacBio uses the BAM format exclusively. For basic functionality, one should use BamReader; use IndexedBamReader API for full index operation support, which requires the auxiliary PacBio BAM index file (bam.pbi file).

class pbcore.io.BamAlignment(bamReader, pysamAlignedRead, rowNumber=None)
DeletionQV(aligned=True, orientation='native')
DeletionTag(aligned=True, orientation='native')
property HoleNumber
IPD(aligned=True, orientation='native')
InsertionQV(aligned=True, orientation='native')
property MapQV
MergeQV(aligned=True, orientation='native')
PulseWidth(aligned=True, orientation='native')
SubstitutionQV(aligned=True, orientation='native')
property barcode
property barcodeName
baseFeature(featureName, aligned=True, orientation='native')

Retrieve the base feature as indicated. - aligned : whether gaps should be inserted to reflect the alignment - orientation: “native” or “genomic”

Note that this function assumes the the feature is stored in native orientation in the file, so it is not appropriate to use this method to fetch the read or the qual, which are oriented genomically in the file.

clippedTo(refStart, refEnd)

Return a new BamAlignment that refers to a subalignment of this alignment, as induced by clipping to reference coordinates refStart to refEnd.

Warning

This function takes time linear in the length of the alignment.

property hasPbi
property hqRegionSnr

Return the per-channel SNR averaged over the HQ region.

Note

This capability was not available in cmp.h5 files, so use of this property can result in code that won’t work on legacy data.

property identity
property isCCS
property isForwardStrand
property isMapped
property isReverseStrand
property isTranscript
property isUnmapped
property mapQV
property movieName
property numPasses
property qEnd
property qId
property qLen
property qName
property qStart
property queryEnd
property queryName
property queryStart
read(aligned=True, orientation='native')
property readGroupInfo
readPositions(aligned=True, orientation='native')

Returns an array of read positions.

If aligned is True, the array has the same length as the alignment and readPositions[i] = read position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the mapped reference segment and readPositions[i] = read position of the i’th base in the oriented reference segment.

property readScore

Return the “read score”, a de novo prediction (not using any alignment) of the accuracy (between 0 and 1) of this read.

Note

This capability was not available in cmp.h5 files, so use of this property can result in code that won’t work on legacy data.

property readType
property reader
reference(aligned=True, orientation='native')
property referenceId
property referenceInfo
property referenceName
referencePositions(aligned=True, orientation='native')

Returns an array of reference positions.

If aligned is True, the array has the same length as the alignment and referencePositions[i] = reference position of the i’th column in the oriented alignment.

If aligned is False, the array has the same length as the read and referencePositions[i] = reference position of the i’th base in the oriented read.

property scrapType
property sequencingChemistry
property tId
transcript(orientation='native', style='gusfield')

A text representation of the alignment moves (see Gusfield). This can be useful in pretty-printing an alignment.

unrolledCigar(orientation='native')

Run-length decode the CIGAR encoding, and orient. Clipping ops are removed.

property zScore
class pbcore.io.BamReader(fname, referenceFastaFname=None)

Reader for a BAM with a bam.bai (SAMtools) index, but not a bam.pbi (PacBio) index. Supports basic BAM operations.

property index
readsInRange(winId, winStart, winEnd, justIndices=False)
class pbcore.io.IndexedBamReader(fname, referenceFastaFname=None, sharedIndex=None)

A IndexedBamReader is a BAM reader class that uses the bam.pbi (PacBio BAM index) file to enable random access by “row number” and to provide access to precomputed semantic information about the BAM records

atRowNumber(rn)
property identity

Fractional alignment sequence identities as numpy array.

property index
readsInRange(winId, winStart, winEnd, justIndices=False)

FASTA Format

FASTA is a standard format for sequence data. We recommmend using the FastaTable class, which provides random access to indexed FASTA files (using the conventional SAMtools “fai” index).

pbcore.io.FastaTable

alias of IndexedFastaReader

class pbcore.io.FastaRecord(header, sequence)

A FastaRecord object models a named sequence in a FASTA file.

property comment

The comment associated with the sequence in the FASTA file, equal to the contents of the FASTA header following the first whitespace

classmethod fromString(s)

Interprets a string as a FASTA record. Does not make any assumptions about wrapping of the sequence string.

property header

The header of the sequence in the FASTA file, equal to the entire first line of the FASTA record following the ‘>’ character.

Warning

You should almost certainly be using “id”, not “header”.

property id

The id of the sequence in the FASTA file, equal to the FASTA header up to the first whitespace.

property length

Get the length of the FASTA sequence

property name

DEPRECATED: The name of the sequence in the FASTA file, equal to the entire FASTA header following the ‘>’ character

reverseComplement(preserveHeader=False)

Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name

property sequence

The sequence for the record as present in the FASTA file. (Newlines are removed but otherwise no sequence normalization is performed).

class pbcore.io.FastaReader(f)

Streaming reader for FASTA files, useable as a one-shot iterator over FastaRecord objects. Agnostic about line wrapping.

Example:

>>> from pbcore.io import FastaReader
>>> from pbcore import data
>>> filename = data.getTinyFasta()
>>> r = FastaReader(filename)
>>> for record in r:
...     print((record.header, len(record.sequence)))
('ref000001|EGFR_Exon_2', 183)
('ref000002|EGFR_Exon_3', 203)
('ref000003|EGFR_Exon_4', 215)
('ref000004|EGFR_Exon_5', 157)
>>> r.close()
class pbcore.io.FastaWriter(f)

A FASTA file writer class

Example:

>>> from pbcore.io import FastaWriter
>>> with FastaWriter("output.fasta.gz") as writer:
...     writer.writeRecord("dog", "GATTACA")
...     writer.writeRecord("cat", "CATTACA")

(Notice that underlying file will be automatically closed after exit from the with block.)

writeRecord(*args)

Write a FASTA record to the file. If given one argument, it is interpreted as a FastaRecord. Given two arguments, they are interpreted as the name and the sequence.

FASTQ Format

FASTQ is a standard format for sequence data with associated quality scores.

class pbcore.io.FastqRecord(header, sequence, quality=None, qualityString=None)

A FastqRecord object models a named sequence and its quality values in a FASTQ file. For reference consult Wikipedia’s FASTQ entry. We adopt the Sanger encoding convention, allowing the encoding of QV values in [0, 93] using ASCII 33 to 126. We only support FASTQ files in the four-line convention (unwrapped). Wrapped FASTQ files are generally considered a bad idea as the @, + delimiters can also appear in the quality string, thus parsing cannot be done safely.

property comment

The comment associated with the sequence in the FASTQ file, equal to the contents of the FASTQ header following the first whitespace

classmethod fromString(s)

Interprets a string as a FASTQ record. Only supports four-line format, as wrapped FASTQs can’t easily be safely parsed.

property header

The header of the sequence in the FASTQ file

property id

The id of the sequence in the FASTQ file, equal to the FASTQ header up to the first whitespace.

property length

The length of the sequence

property name

DEPRECATED: The name of the sequence in the FASTQ file

property quality

The quality values, as an array of integers

property qualityString

The quality values as an ASCII-encoded string

reverseComplement(preserveHeader=False)

Return a new FastaRecord with the reverse-complemented DNA sequence. Optionally, supply a name

property sequence

The sequence for the record as present in the FASTQ file.

class pbcore.io.FastqReader(f)

Reader for FASTQ files, useable as a one-shot iterator over FastqRecord objects. FASTQ files must follow the four-line convention.

class pbcore.io.FastqWriter(f)

A FASTQ file writer class

Example:

>>> from pbcore.io import FastqWriter
>>> with FastqWriter("output.fq.gz") as writer:
...     writer.writeRecord("dog", "GATTACA", [35]*7)
...     writer.writeRecord("cat", "CATTACA", [35]*7)

(Notice that underlying file will be automatically closed after exit from the with block.)

writeRecord(*args)

Write a FASTQ record to the file. If given one argument, it is interpreted as a FastqRecord. Given three arguments, they are interpreted as the name, sequence, and quality.

GFF Format (Version 3)

The GFF format is an open and flexible standard for representing genomic features.

class pbcore.io.Gff3Record(seqid, start, end, type, score='.', strand='.', phase='.', source='.', attributes=())

Class for GFF record, providing uniform access to standard GFF fields and attributes.

>>> from pbcore.io import Gff3Record
>>> record = Gff3Record("chr1", 10, 11, "insertion",
...                     attributes=[("foo", "1"), ("bar", "2")])
>>> record.start
10
>>> record.foo
'1'
>>> record.baz = 3
>>> del record.baz

Attribute access using record.fieldName notation raises ValueError if an attribute named fieldName doesn’t exist. Use:

>>> record.get(fieldName)

to fetch a field or attribute with None default or:

>>> record.get(fieldName, defaultValue)

to fetch the field or attribute with a custom default.

copy()

Return a shallow copy

classmethod fromString(s)

Parse a string as a GFF record. Trailing whitespace is ignored.

class pbcore.io.GffReader(f)

A GFF file reader class

class pbcore.io.GffWriter(f)

A GFF file writer class