Other parsers

VCF_Reader and VariantCall

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.

There is an option whether to contain genotype information on samples for each position or not.

See the definitions at

As usual, there is a parser class, called VCF_Reader, that can generate an iterator of objects describing the structural variant calls. These objects are of type VariantCall and each describes one line of a VCF file. See below for an example.

class HTSeq.VCF_Reader(filename_or_sequence)

As a subclass of FileOrSequence, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.

When requesting an iterator, it generates objects of type VariantCall.

metadata

VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).

parse_meta(header_filename = None)

The VCF_Reader normally does not parse the meta-information and also the VariantCall does not contain unpacked metainformation. The function parse_meta reads the header information either from the attached FileOrSequence or from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.

make_info_dict()

This function will parse the info string and create the attribute infodict which contains a dict with key:value-pairs containig the type-information for each entry of the VariantCall‘s info field.

class HTSeq.VariantCall(line, nsamples = 0, sampleids=[])

A VariantCall object always contains the following attributes:

alt

The alternative base(s) of the VariantCall. This is a list containing all called alternatives.

chrom

The Chromosome on which the VariantCall was called.

filter

This specifies if the VariantCall passed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).

format

Contains the format string specifying which per-sample information is stored in VariantCall.samples.

id

The id of the VariantCall, if it has been found in any database, for unknown variants this will be ”.”.

info

This will contain either the string version of the info field for this VariantCall or a dict with the parsed and processed info-string.

pos

A HTSeq.GenomicPosition that specifies the position of the VariantCall.

qual

The quality of the VariantCall.

ref

The reference base(s) of the VariantCall.

samples

A dict mapping sample-id’s to subdicts which use the VariantCall.format as keys to store the per-sample information.

unpack_info(infodict)

This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.

Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):

>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" ) 
>>> vcfr.parse_meta() 
>>> vcfr.make_info_dict() 
>>> for vc in vcfr: 
...    print vc,
1:10327:'T'->'C'
1:10433:'A'->'AC'
1:10439:'AC'->'A'
1:10440:'C'->'A'

FIXME The example above is not run, as the example file is still missing!

Wiggle Reader

The Wiggle format (file extension often .wig) is a format to describe numeric scores assigned to base-pair positions on a genome. The class WiggleReader is parser for such files.

class HTSeq.WiggleReader(filename_or_sequence, verbose=True)

The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A WiggleReader object generates an iterator, which yields pairs of the form (iv, score), where iv is a GenomicInterval object and score is a float with the score that the file assigns to the specified interval. If verbose is set to True, the user is alerted to skipped lines (comments or browser lines) by a message printed to the standard output.

BED Reader

The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.

class HTSeq.BED_Reader(filename_or_sequence)

The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A BED_Reader object generates an iterator, which yields a GenomicFeature object for each line in the BED file (except for lines starting with track, whcih are skipped).

The attributes of the yielded GenomicFeature objects are as follows:

iv
a GenomicInterval object with the coordinates as given by the 1st, 2nd, 3rd, and 6th column of the BED file. If the BED file has less than 6 columns, the strand is set to “.”.
name
the name of feature as given in the 4th column, or unnamed, if the file has only three columns
type
always the string BED line
score
a float with the score as given by the 5th column (or None if the BED file has less 5 columns).
thick
a GenomicInterval object containg the “thick” part of the feature, as specified by the 6th and 7th column, with chromosome and strand copied from iv (or None if the BED file has less 7 columns).
itemRgb
a list of three int values, taken from the 8th column (None if the BED file has less 8 columns). In a BED file, this triple is meant to specify the colour in which the feature should be drawn in a browser.