MolType

class MolType(motifset, gap='-', missing='?', gaps=None, seq_constructor=None, ambiguities=None, label=None, complements=None, pairs=None, mw_calculator=None, add_lower=False, preserve_existing_moltypes=False, make_alphabet_group=False, array_seq_constructor=None, colors=None)

MolType: Handles operations that depend on the sequence type (e.g. DNA).

The MolType knows how to connect alphabets, sequences, alignments, and so forth, and how to disambiguate ambiguous symbols and perform base pairing (where appropriate).

WARNING: Objects passed to a MolType become associated with that MolType, i.e. if you pass ProteinSequence to a new MolType you make up, all ProteinSequences will now be associated with the new MolType. This may not be what you expect. Use preserve_existing_moltypes=True if you don’t want to reset the moltype.

can_match(first, second)

Returns True if every pos in 1st could match same pos in 2nd.

Truncates at length of shorter sequence. gaps are only allowed to match other gaps.

can_mismatch(first, second)

Returns True if any position in 1st could cause a mismatch with 2nd.

Truncates at length of shorter sequence. gaps are always counted as matches.

can_mispair(first, second)

Returns True if any position in 1st could mispair with 2nd.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

Truncates at length of shorter sequence. gaps are always counted as possible mispairs, as are weak pairs like GU.

can_pair(first, second)

Returns True if first and second could pair.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

Truncates at length of shorter sequence. gaps are only allowed to pair with other gaps, and are counted as ‘weak’ (same category as GU and degenerate pairs).

NOTE: second must be able to be reverse

complement(item)

Returns complement of item, using data from self.complements.

Always tries to return same type as item: if item looks like a dict, will return list of keys.

count_degenerate(sequence)

Counts the degenerate bases in the specified sequence.

count_gaps(sequence)

Counts the gaps in the specified sequence.

degap(sequence)

Deletes all gap characters from sequence.

degenerate_from_seq(sequence)

Returns least degenerate symbol corresponding to chars in sequence.

First tries to look up in self.inverse_degenerates. Then disambiguates and tries to look up in self.inverse_degenerates. Then tries converting the case (tries uppercase before lowercase). Raises TypeError if conversion fails.

disambiguate(sequence, method='strip')

Returns a non-degenerate sequence from a degenerate one.

method can be ‘strip’ (deletes any characters not in monomers or gaps) or ‘random’(assigns the possibilities at random, using equal frequencies).

first_degenerate(sequence)

Returns the index of first degenerate symbol in sequence, or None.

first_gap(sequence)

Returns the index of the first gap in the sequence, or None.

first_invalid(sequence)

Returns the index of first invalid symbol in sequence, or None.

first_non_strict(sequence)

Returns the index of first non-strict symbol in sequence, or None.

first_not_in_alphabet(sequence, alphabet=None)

Returns index of first item not in alphabet, or None.

Defaults to self.alphabet if alphabet not supplied.

gap_indices(sequence)

Returns list of indices of all gaps in the sequence, or [].

gap_maps(sequence)

Returns tuple containing dicts mapping between gapped and ungapped.

First element is a dict such that d[ungapped_coord] = gapped_coord. Second element is a dict such that d[gapped_coord] = ungapped_coord.

Note that the dicts will be invalid if the sequence changes after the dicts are made.

The gaps themselves are not in the dictionary, so use d.get() or test ‘if pos in d’ to avoid KeyErrors if looking up all elements in a gapped sequence.

gap_vector(sequence)

Returns list of bool indicating gap or non-gap in sequence.

get_css_style(colors=None, font_size=12, font_family='Lucida Console')

returns string of CSS classes and {character: <CSS class name>, …}

Parameters:
  • colors – {char

  • font_size – in points

  • font_family – name of a monospace font

get_degenerate_positions(sequence, include_gap=True)

returns indices matching degenerate characters

get_type()

Return the moltype label

gettype()

Return the moltype label.

is_ambiguity(querymotif)

Return True if querymotif is an amibiguity character in alphabet.

Parameters:

querymotif – the motif being queried.

is_degenerate(sequence)

Returns True if sequence contains degenerate characters.

is_gap(char)

Returns True if char is a gap.

is_gapped(sequence)

Returns True if sequence contains gaps.

is_strict(sequence)

Returns True if sequence contains only items in self.alphabet.

is_valid(sequence)

Returns True if sequence contains no items that are not in self.

make_array_seq(seq, name=None, **kwargs)

creates an array sequence

Parameters:
  • seq – characters or array

  • name (str) –

  • kwargs – keyword arguments for the ArraySequence constructor.

Return type:

ArraySequence

make_seq(seq, name=None, **kwargs)

Returns sequence of correct type.

must_match(first, second)

Returns True if all positions in 1st must match positions in second.

must_pair(first, second)

Returns True if all positions in 1st must pair with second.

Pairing occurs in reverse order, i.e. last position of second with first position of first, etc.

mw(sequence, method='random', delta=None)

Returns the molecular weight of the sequence.

If the sequence is ambiguous, uses method (random or strip) to disambiguate the sequence.

if delta is present, uses it instead of the standard weight adjustment.

possibilities(sequence)

Counts number of possible sequences matching the sequence.

Uses self.degenerates to decide how many possibilites there are at each position in the sequence.

rc(item)

Returns reverse complement of item w/ data from self.complements.

Always returns same type as input.

strand_symmetric_motifs(motif_length=1)

returns ordered pairs of strand complementary motifs

to_json()

returns result of json formatted string

to_regex(seq)

returns a regex pattern with ambiguities expanded to a character set

to_rich_dict(for_pickle=False)
valid_on_alphabet(sequence, alphabet=None)

Returns True if sequence contains only items in alphabet.

alphabet can actually be anything that implements __contains__. Defaults to self.alphabet if not supplied.

verify_sequence(seq, gaps_allowed=True, wildcards_allowed=True)

Checks whether sequence is valid on the default alphabet.

Has special-case handling for gaps and wild-cards. This mechanism is probably useful to have in parallel with the validation routines that check specifically whether the sequence has gaps, degenerate symbols, etc., or that explicitly take an alphabet as input.

what_ambiguity(motifs)

The code that represents all of ‘motifs’, and minimal others.

Does this duplicate DegenerateFromSequence directly?