Miscellaneous functions for manipulating text

Collection of text functions that don’t fit in another category.

Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Added isbasestring(), isbytestring(), and isunicodestring() to help tell which string type is which on python2 and python3

kitchen.text.misc.byte_string_valid_encoding(byte_string, encoding='utf-8')

Detect if a byte bytes is valid in a specific encoding

Parameters
  • byte_string – Byte bytes to test for bytes not valid in this encoding

  • encoding – encoding to test against. Defaults to UTF-8.

Returns

True if there are no invalid UTF-8 characters. False if an invalid character is detected.

Note

This function checks whether the byte bytes is valid in the specified encoding. It does not detect whether the byte bytes actually was encoded in that encoding. If you want that sort of functionality, you probably want to use guess_encoding() instead.

kitchen.text.misc.byte_string_valid_xml(byte_string, encoding='utf-8')

Check that a byte bytes would be valid in xml

Parameters
  • byte_string – Byte bytes to check

  • encoding – Encoding of the xml file. Default: UTF-8

Returns

True if the string is valid. False if it would be invalid in the xml file

In some cases you’ll have a whole bunch of byte strings and rather than transforming them to str and back to byte bytes for output to xml, you will just want to make sure they work with the xml file you’re constructing. This function will help you do that. Example:

ARRAY_OF_MOSTLY_UTF8_STRINGS = [...]
processed_array = []
for string in ARRAY_OF_MOSTLY_UTF8_STRINGS:
    if byte_string_valid_xml(string, 'utf-8'):
        processed_array.append(string)
    else:
        processed_array.append(guess_bytes_to_xml(string, encoding='utf-8'))
output_xml(processed_array)
kitchen.text.misc.guess_encoding(byte_string, disable_chardet=False)

Try to guess the encoding of a byte bytes

Parameters
  • byte_string – byte bytes to guess the encoding of

  • disable_chardet – If this is True, we never attempt to use chardet to guess the encoding. This is useful if you need to have reproducibility whether chardet is installed or not. Default: False.

Raises

TypeError – if byte_string is not a byte bytes type

Returns

string containing a guess at the encoding of byte_string. This is appropriate to pass as the encoding argument when encoding and decoding unicode strings.

We start by attempting to decode the byte bytes as UTF-8. If this succeeds we tell the world it’s UTF-8 text. If it doesn’t and chardet is installed on the system and disable_chardet is False this function will use it to try detecting the encoding of byte_string. If it is not installed or chardet cannot determine the encoding with a high enough confidence then we rather arbitrarily claim that it is latin-1. Since latin-1 will encode to every byte, decoding from latin-1 to str will not cause UnicodeErrors although the output might be mangled.

kitchen.text.misc.html_entities_unescape(string)

Substitute unicode characters for HTML entities

Parameters

stringstr string to substitute out html entities

Raises

TypeError – if something other than a str string is given

Return type

str string

Returns

The plain text without html entities

kitchen.text.misc.isbasestring(obj)

Determine if obj is a byte bytes or str string

In python2 this is eqiuvalent to isinstance(obj, basestring). In python3 it checks whether the object is an instance of str, bytes, or bytearray. This is an aid to porting code that needed to test whether an object was derived from basestring in python2 (commonly used in unicode-bytes conversion functions)

Parameters

obj – Object to test

Returns

True if the object is a basestring. Otherwise False.

New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

kitchen.text.misc.isbytestring(obj)

Determine if obj is a byte bytes

In python2 this is equivalent to isinstance(obj, str). In python3 it checks whether the object is an instance of bytes or bytearray.

Parameters

obj – Object to test

Returns

True if the object is a byte bytes. Otherwise, False.

New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

kitchen.text.misc.isunicodestring(obj)

Determine if obj is a str string

In python2 this is equivalent to isinstance(obj, unicode). In python3 it checks whether the object is an instance of bytes.

Parameters

obj – Object to test

Returns

True if the object is a str string. Otherwise, False.

New in version Kitchen:: 1.2.0, API kitchen.text 2.2.0

kitchen.text.misc.process_control_chars(string, strategy='replace')

Look for and transform control characters in a string

Parameters
  • string – string to search for and transform control characters within

  • strategy

    XML does not allow ASCII control characters. When we encounter those we need to know what to do. Valid options are:

    replace

    (default) Replace the control characters with "?"

    ignore

    Remove the characters altogether from the output

    strict

    Raise a ControlCharError when we encounter a control character

Raises
Returns

str string with no control characters in it.

Changed in version kitchen: 1.2.0, API: kitchen.text 2.2.0 Strip out the C1 control characters in addition to the C0 control characters.

kitchen.text.misc.str_eq(str1, str2, encoding='utf-8', errors='replace')

Compare two strings, converting to byte bytes if one is str

Parameters
  • str1 – First string to compare

  • str2 – Second string to compare

  • encoding – If we need to convert one string into a byte bytes to compare, the encoding to use. Default is utf-8.

  • errors – What to do if we encounter errors when encoding the string. See the kitchen.text.converters.to_bytes() documentation for possible values. The default is replace.

This function prevents UnicodeError (python-2.4 or less) and UnicodeWarning (python 2.5 and higher) when we compare a str string to a byte bytes. The errors normally arise because the conversion is done to ASCII. This function lets you convert to utf-8 or another encoding instead.

Note

When we need to convert one of the strings from str in order to compare them we convert the str string into a byte bytes. That means that strings can compare differently if you use different encodings for each.

Note that str1 == str2 is faster than this function if you can accept the following limitations:

  • Limited to python-2.5+ (otherwise a UnicodeDecodeError may be thrown)

  • Will generate a UnicodeWarning if non-ASCII byte bytes is compared to str string.