Other parsers¶
VCF_Reader
and VariantCall
¶
VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
See the definitions at
As usual, there is a parser class, called VCF_Reader, that can generate an
iterator of objects describing the structural variant calls. These objects are of type VariantCall
and each describes one line of a VCF file. See below for an example.
-
class
HTSeq.
VCF_Reader
(filename_or_sequence)¶ As a subclass of
FileOrSequence
, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.When requesting an iterator, it generates objects of type
VariantCall
.-
metadata
¶ VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).
-
parse_meta
(header_filename = None)¶ The VCF_Reader normally does not parse the meta-information and also the
VariantCall
does not contain unpacked metainformation. The function parse_meta reads the header information either from the attachedFileOrSequence
or from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.
-
make_info_dict
()¶ This function will parse the info string and create the attribute
infodict
which contains a dict with key:value-pairs containig the type-information for each entry of theVariantCall
‘s info field.
-
-
class
HTSeq.
VariantCall
(line, nsamples = 0, sampleids=[])¶ A VariantCall object always contains the following attributes:
-
alt
¶ The alternative base(s) of the
VariantCall
. This is a list containing all called alternatives.
-
chrom
¶ The Chromosome on which the
VariantCall
was called.
-
filter
¶ This specifies if the
VariantCall
passed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).
-
format
¶ Contains the format string specifying which per-sample information is stored in
VariantCall.samples
.
-
id
¶ The id of the
VariantCall
, if it has been found in any database, for unknown variants this will be ”.”.
-
info
¶ This will contain either the string version of the info field for this
VariantCall
or a dict with the parsed and processed info-string.
-
pos
¶ A
HTSeq.GenomicPosition
that specifies the position of theVariantCall
.
-
qual
¶ The quality of the
VariantCall
.
-
ref
¶ The reference base(s) of the
VariantCall
.
-
samples
¶ A dict mapping sample-id’s to subdicts which use the
VariantCall.format
as keys to store the per-sample information.
-
unpack_info
(infodict)¶ This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.
-
Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):
>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" )
>>> vcfr.parse_meta()
>>> vcfr.make_info_dict()
>>> for vc in vcfr:
... print vc,
1:10327:'T'->'C'
1:10433:'A'->'AC'
1:10439:'AC'->'A'
1:10440:'C'->'A'
FIXME The example above is not run, as the example file is still missing!
Wiggle Reader¶
The Wiggle format (file extension often .wig
) is a format to describe numeric scores assigned to base-pair positions on a genome.
The class WiggleReader
is parser for such files.
-
class
HTSeq.
WiggleReader
(filename_or_sequence, verbose=True)¶ The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A
WiggleReader
object generates an iterator, which yields pairs of the form(iv, score)
, whereiv
is aGenomicInterval
object andscore
is afloat
with the score that the file assigns to the specified interval. Ifverbose
is set to True, the user is alerted to skipped lines (comments orbrowser
lines) by a message printed to the standard output.
BED Reader¶
The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.
-
class
HTSeq.
BED_Reader
(filename_or_sequence)¶ The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A
BED_Reader
object generates an iterator, which yields aGenomicFeature
object for each line in the BED file (except for lines starting withtrack
, whcih are skipped).The attributes of the yielded
GenomicFeature
objects are as follows:iv
- a
GenomicInterval
object with the coordinates as given by the 1st, 2nd, 3rd, and 6th column of the BED file. If the BED file has less than 6 columns, the strand is set to “.
”. name
- the name of feature as given in the 4th column, or
unnamed
, if the file has only three columns type
- always the string
BED line
score
- a float with the score as given by the 5th column (or
None
if the BED file has less 5 columns). thick
- a
GenomicInterval
object containg the “thick” part of the feature, as specified by the 6th and 7th column, with chromosome and strand copied fromiv
(orNone
if the BED file has less 7 columns). itemRgb
- a list of three
int
values, taken from the 8th column (None
if the BED file has less 8 columns). In a BED file, this triple is meant to specify the colour in which the feature should be drawn in a browser.