Alphabet

class Alphabet(motifset, gap='-', moltype=None)

An ordered set of fixed-length strings, e.g. the 61 sense codons.

ambiguities (e.g. N for any base in DNA) are not considered part of the alphabet itself, although a sequence is valid on the alphabet even if it contains ambiguities that are known to the alphabet. A gap is considered a separate motif and is not part of the alphabet itself.

The typical use is for the Alphabet to hold nucleic acid bases, amino acids, or codons.

The moltype, if supplied, handles ambiguities, coercion of the sequence to the correct data type, and complementation (if appropriate).

exception AlphabetError
add_note()

Exception.add_note(note) – add a note to the exception

args
with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

property Triples

Accessor for triples, lazy evaluation.

adapt_motif_probs(motif_probs)

Prepare an array or dictionary of probabilities for use with this alphabet by checking size and order

count(value, /)

Return number of occurrences of value.

counts(a)

Returns array containing counts of each item in a.

For example, on the enumeration ‘UCAG’, the sequence ‘CCUG’ would return the array [1,2,0,1] reflecting one count for the first item in the enumeration (‘U’), two counts for the second item (‘C’), no counts for the third item (‘A’), and one count for the last item (‘G’).

The result will always be a vector of Int with length equal to the length of the enumeration. We return Int and non an unsigned type because it’s common to subtract counts, which produces surprising results on unit types (i.e. wrapraround to maxint) unless the type is explicitly coerced by the user.

Sliently ignores any unrecognized indices, e.g. if your enumeration contains ‘TCAG’ and you get an ‘X’, the ‘X’ will be ignored because it has no index in the enumeration.

from_indices(data)

Returns sequence of elements from sequence of indices.

Specifically, takes as input a sequence of numbers corresponding to elements in the Enumeration (i.e. the numbers must all be < len(self). Returns a list of the items in the same order as the indices. Inverse of to_indices.

e.g. for the DNA alphabet (‘U’,’C’,’A’,’G’), the sequence [1,1,2,0] would produce the result ‘CCAU’, returning the element corresponding to each element in the input.

from_ordinals_to_seq(data)

Returns a Sequence object corresponding to indices in data.

Parameters:

data (series) – series of int

Return type:

Sequence with self.moltype

Notes

Unlike from_indices(), this method uses the MolType to coerce the result into a sequence of the correct class.

Raises an AttributeError if MolType is not set.

from_seq_to_array(sequence)

Returns an array of indices corresponding to items in sequence.

Parameters:

sequence (Sequence) – A cogent3 sequence object

Return type:

ndarray

Notes

Unlike to_indices() in superclass, this method returns a numpy array object. It also breaks the seqeunce into items in the current alphabet (e.g. breaking a raw DNA sequence into codons), which to_indices() does

get_gap_motif()

Returns the motif that self is using as a gap. Note that this will typically be a multiple of self.gap.

get_matched_array(motifs, dtype=<class 'float'>)

Returns an array in which rows are motifs, columns are items in self.

Result is an array of Float in which a[i][j] indicates whether the ith motif passed in as motifs is a symbol that matches the jth character in self. For example, on the DNA alphabet ‘TCAG’, the degenerate symbol ‘Y’ would correspond to the row [1,1,0,0] because Y is a degenerate symbol that encompasses T and C but not A or G.

This code is similar to code in the Profile class, and should perhaps be merged with it (in particular, because there is nothing likelihood- specific about the resulting match table).

get_motif_len()

Returns the length of the items in self, or None if they differ.

get_subset(motif_subset, excluded=False)

Returns a new Alphabet object containing a subset of motifs in self.

Raises an exception if any of the items in the subset are not already in self. Always returns a new object.

get_word_alphabet(word_length)

Returns a new Alphabet object with items as word_length strings.

Note that the result is not a JointEnumeration object, and cannot unpack its indices. However, the items in the result _are_ all strings.

includes_gap_motif()

Returns True if self includes the gap motif, False otherwise.

index(item)

Returns the index of a specified item.

This goes through an extra object lookup. If you _really_ need speed, you can bind self._obj_to_index.__getitem__ directly, but this is not recommended because the internal implementation may change.

is_valid(seq)

Returns True if seq contains only items in self.

property pairs

Accessor for pairs, lazy evaluation.

resolve_ambiguity(ambig_motif)

Returns set of symbols corresponding to ambig_motif.

Handles multi-character symbols and screens against the set of valid motifs, unlike the MolType version.

to_indices(data)

Returns sequence of indices from sequence of elements.

Raises KeyError if some of the elements were not found.

Expects data to be a sequence (e.g. list of tuple) of items that are in the Enumeration. Returns a list containing the index of each element in the input, in order.

e.g. for the RNA alphabet (‘U’,’C’,’A’,’G’), the sequence ‘CCAU’ would produce the result [1,1,2,0], returning the index of each element in the input.

to_json()

returns result of json formatted string

to_rich_dict(for_pickle=False)
with_gap_motif()

Returns an Alphabet object resembling self but including the gap.

Always returns the same object.