Features

The easiest way to work with annotation is to use GenomicArray with typecode=='O' or GenomicArrayOfSets. If you have your annotation in a flat file, with each line describing a feature and giving its coordinates, you can read in the file line for line, parse it (see the standard Python module csv), use the information on chromosome, start, end and strand to create a GenomicInterval object and then store the data from the line in the genomic array at the place indicated by the genomic interval.

For example, if you have data in a tab-separated file as follows:

>>> for line in open( "feature_list.txt" ):  
...     print line,
chr2  100    300     +       "gene A"
chr2 200     400     -       "gene B"
chr3 150     270     +       "gene C"

Then, you could load this information as follows:

>>> import csv
>>> genes = HTSeq.GenomicArray( [ "chr1", "chr2", "chr3" ], typecode='O' )
>>> for (chrom, start, end, strand, name) in \
...        csv.reader( open("feature_list.txt"), delimiter="\t" ):
...     iv = HTSeq.GenomicInterval( chrom, int(start), int(end), strand )
...     genes[ iv ] = name

Now, to see whether there is a feature at a given GenomicPosition, you just query the genomic array:

>>> print genes[ HTSeq.GenomicPosition( "chr3", 100, "+" ) ]
None
>>> print genes[ HTSeq.GenomicPosition( "chr3", 200, "+" ) ]
gene C

See GenomicArray and GenomicArrayOfSets for more sophisticated use.

GFF_Reader and GenomicFeature

One of the most common format for annotation data is GFF (which includes GTF as a sub-type). Hence, a parse for GFF files is included in HTSeq.

As usual, there is a parser class, called GFF_Reader, that can generate an iterator of objects describing the features. These objects are of type :class`GenomicFeature` and each describes one line of a GFF file. See Section A tour through HTSeq for an example.

class HTSeq.GFF_Reader(filename_or_sequence, end_included=True)

As a subclass of FileOrSequence, GFF_Reader can be initialized either with a file name or with an open file or another sequence of lines.

When requesting an iterator, it generates objects of type GenomicFeature.

The GFF specification is unclear on whether the end coordinate marks the last base-pair of the feature (closed intervals, end_included=True) or the one after (half-open intervals, end_included=False). The default, True, is correct for Ensembl GTF files. If in doubt, look for a CDS or stop_codon feature in you GFF file. Its length should be divisible by 3. If “end-start” is divisible by 3, you need end_included=False. If “end-start+1” is divisible by 3, you need end_included=True.

GFF_Reader will convert the coordinates from GFF standard (1-based, end maybe included) to HTSeq standard (0-base, end not included) by subtracting 1 from the start position, and, for end_included=True, also subtract 1 from the end position.

metadata

GFF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (at least accoring to the Sanger Institute’s version of the GFF standard.) Such meta data has the format ##key value. When a metadata line is encountered, it is added to the metadata dictionary.

class HTSeq.GenomicFeature(name, type_, interval)

A GenomicFeature object always contains the following attributes:

name

A name of ID for the feature. As the GFF format does not have a dedicated field for this, the value of the first attribute in the attributes column is assumed to be the name of ID.

type

The type of the feature, i.e., a string like "exon" or "gene". For GFF files, the 3rd column (feature) is taken as the type.

interval

The interval that the feature covers on the genome. For GFF files, this information is taken from the first (seqname), the forth (start), the fifth (end), and the seventh (strand) column.

When created by a GFF_Reader object, the following attributes are also present, with the information from the remaining GFF columns:

source

The 2nd column, denoted source in the specification, and intended to specify the data source.

frame

The 8th column (frame), giving the reading frame in case of a coding feature. Its value is an integer (0, 1, or 2), or the string '.' in case that a frame is not specified or would not make sense.

score

The 6th column (score), giving some numerical score for the feature. Its value is a float, or '.' in case that a score is not specified or would not make sense

attr

The last (9th) column of a GFF file contains attributes, i.e. a list of name/value pairs. These are transformed into a dict, such that, e.g., gf.attr['gene_id'] gives the value of the attribute gene_id in the feature described by GenomicFeature object gf. The parser for the attribute field is reasonably flexible to deal with format variations (it was never clearly established whetehr name and value should be sperarated by a colon or an equal sign, and whether quotes need to be used) and also does a URL style decoding, as is often required.

In order to write a GFF file from a sequence of features, this method is provided:

get_gff_line(with_equal_sign=False)

Returns a line to describe the feature in the GFF format. This works even if the optional attributes given above are missing. Call this for each of your GenomicFeature objects and write the lines into a file to get a GFF file.

HTSeq.parse_GFF_attribute_string(attrStr, extra_return_first_value=False)

This is the function that GFF_Reader uses to parse the attribute column. (See GenomicFeature.attr.) It returns a dict, or, if requested, a pair of the dict and the first value.