Kitchen.text.converters¶
Functions to handle conversion of byte bytes
and str
strings.
Changed in version kitchen: 0.2a2 ; API kitchen.text 2.0.0
Added getwriter()
Changed in version kitchen: 0.2.2 ; API kitchen.text 2.1.0
Added exception_to_unicode()
,
exception_to_bytes()
,
EXCEPTION_CONVERTERS
,
and BYTE_EXCEPTION_CONVERTERS
Changed in version kitchen: 1.0.1 ; API kitchen.text 2.1.1
Deprecated BYTE_EXCEPTION_CONVERTERS
as
we’ve simplified exception_to_unicode()
and
exception_to_bytes()
to make it unnecessary
Byte Strings and Unicode in Python2¶
Python2 has two string types, str
and unicode
.
unicode
represents an abstract sequence of text characters. It can
hold any character that is present in the unicode standard. str
can
hold any byte of data. The operating system and python work together to
display these bytes as characters in many cases but you should always keep in
mind that the information is really a sequence of bytes, not a sequence of
characters. In python2 these types are interchangeable a large amount of the
time. They are one of the few pairs of types that automatically convert when
used in equality:
>>> # string is converted to unicode and then compared
>>> "I am a string" == u"I am a string"
True
>>> # Other types, like int, don't have this special treatment
>>> 5 == "5"
False
However, this automatic conversion tends to lull people into a false sense of
security. As long as you’re dealing with ASCII characters the
automatic conversion will save you from seeing any differences. Once you
start using characters that are not in ASCII, you will start getting
UnicodeError
and UnicodeWarning
as the automatic conversions
between the types fail:
>>> "I am an ñ" == u"I am an ñ"
__main__:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
False
Why do these conversions fail? The reason is that the python2
unicode
type represents an abstract sequence of unicode text known as
code points. str
, on the other hand, really represents
a sequence of bytes. Those bytes are converted by your operating system to
appear as characters on your screen using a particular encoding (usually
with a default defined by the operating system and customizable by the
individual user.) Although ASCII characters are fairly standard in
what bytes represent each character, the bytes outside of the ASCII
range are not. In general, each encoding will map a different character to
a particular byte. Newer encodings map individual characters to multiple
bytes (which the older encodings will instead treat as multiple characters).
In the face of these differences, python refuses to guess at an encoding and
instead issues a warning or exception and refuses to convert.
See also
- Overcoming frustration: Correctly using unicode in python2
For a longer introduction on this subject.
Strategy for Explicit Conversion¶
So what is the best method of dealing with this weltering babble of incoherent
encodings? The basic strategy is to explicitly turn everything into
unicode
when it first enters your program. Then, when you send it to
output, you can transform the unicode back into bytes. Doing this allows you
to control the encodings that are used and avoid getting tracebacks due to
UnicodeError
. Using the functions defined in this module, that looks
something like this:
1>>> from kitchen.text.converters import to_unicode, to_bytes
2>>> name = raw_input('Enter your name: ')
3Enter your name: Toshio くらとみ
4>>> name
5'Toshio \xe3\x81\x8f\xe3\x82\x89\xe3\x81\xa8\xe3\x81\xbf'
6>>> type(name)
7<type 'str'>
8>>> unicode_name = to_unicode(name)
9>>> type(unicode_name)
10<type 'unicode'>
11>>> unicode_name
12u'Toshio \u304f\u3089\u3068\u307f'
13>>> # Do a lot of other things before needing to save/output again:
14>>> output = open('datafile', 'w')
15>>> output.write(to_bytes(u'Name: %s\\n' % unicode_name))
A few notes:
Looking at line 6, you’ll notice that the input we took from the user was
a byte str
. In general, anytime we’re getting a value from outside
of python (The filesystem, reading data from the network, interacting with an
external command, reading values from the environment) we are interacting with
something that will want to give us a byte str
. Some python standard library
modules and third party libraries will automatically attempt to convert a byte
str
to unicode
strings for you. This is both a boon and
a curse. If the library can guess correctly about the encoding that the data
is in, it will return unicode
objects to you without you having to
convert. However, if it can’t guess correctly, you may end up with one of
several problems:
UnicodeError
The library attempted to decode a byte
str
into aunicode
, string failed, and raises an exception.- Garbled data
If the library returns the data after decoding it with the wrong encoding, the characters you see in the
unicode
string won’t be the ones that you expect.- A byte
str
instead ofunicode
string Some libraries will return a
unicode
string when they’re able to decode the data and a bytestr
when they can’t. This is generally the hardest problem to debug when it occurs. Avoid it in your own code and try to avoid or open bugs against upstreams that do this. See Designing Unicode Aware APIs for strategies to do this properly.
On line 8, we convert from a byte str
to a unicode
string.
to_unicode()
does this for us. It has some
error handling and sane defaults that make this a nicer function to use than
calling str.decode()
directly:
Instead of defaulting to the ASCII encoding which fails with all but the simple American English characters, it defaults to UTF-8.
Instead of raising an error if it cannot decode a value, it will replace the value with the unicode “Replacement character” symbol (
�
).If you happen to call this method with something that is not a
str
orunicode
, it will return an emptyunicode
string.
All three of these can be overridden using different keyword arguments to the
function. See the to_unicode()
documentation for more information.
On line 15 we push the data back out to a file. Two things you should note here:
We deal with the strings as
unicode
until the last instant. The string format that we’re using isunicode
and the variable also holdsunicode
. People sometimes get into trouble when they mix a bytestr
format with a variable that holds aunicode
string (or vice versa) at this stage.to_bytes()
, does the reverse ofto_unicode()
. In this case, we’re using the default values which turnunicode
into a bytestr
using UTF-8. Any errors are replaced with a�
and sending nonstring objects yield emptyunicode
strings. Just liketo_unicode()
, you can look at the documentation forto_bytes()
to find out how to override any of these defaults.
When to use an alternate strategy¶
The default strategy of decoding to unicode
strings when you take
data in and encoding to a byte str
when you send the data back out
works great for most problems but there are a few times when you shouldn’t:
The values aren’t meant to be read as text
The values need to be byte-for-byte when you send them back out – for instance if they are database keys or filenames.
You are transferring the data between several libraries that all expect byte
str
.
In each of these instances, there is a reason to keep around the byte
str
version of a value. Here’s a few hints to keep your sanity in
these situations:
Keep your
unicode
andstr
values separate. Just like the pain caused when you have to use someone else’s library that returns bothunicode
andstr
you can cause yourself pain if you have functions that can return both types or variables that could hold either type of value.Name your variables so that you can tell whether you’re storing byte
str
orunicode
string. One of the first things you end up having to do when debugging is determine what type of string you have in a variable and what type of string you are expecting. Naming your variables consistently so that you can tell which type they are supposed to hold will save you from at least one of those steps.When you get values initially, make sure that you’re dealing with the type of value that you expect as you save it. You can use
isinstance()
orto_bytes()
sinceto_bytes()
doesn’t do any modifications of the string if it’s already astr
. When usingto_bytes()
for this purpose you might want to use:try: b_input = to_bytes(input_should_be_bytes_already, errors='strict', nonstring='strict') except: handle_errors_somehow()
The reason is that the default of
to_bytes()
will take characters that are illegal in the chosen encoding and transform them to replacement characters. Since the point of keeping this data as a bytestr
is to keep the exact same bytes when you send it outside of your code, changing things to replacement characters should be rasing red flags that something is wrong. Settingerrors
tostrict
will raise an exception which gives you an opportunity to fail gracefully.Sometimes you will want to print out the values that you have in your byte
str
. When you do this you will need to make sure that you transformunicode
tostr
before combining them. Also be sure that any other function calls (includinggettext
) are going to give you strings that are the same type. For instance:print to_bytes(_('Username: %(user)s'), 'utf-8') % {'user': b_username}
Gotchas and how to avoid them¶
Even when you have a good conceptual understanding of how python2 treats
unicode
and str
there are still some things that can
surprise you. In most cases this is because, as noted earlier, python or one
of the python libraries you depend on is trying to convert a value
automatically and failing. Explicit conversion at the appropriate place
usually solves that.
str(obj)¶
One common idiom for getting a simple, string representation of an object is to use:
str(obj)
Unfortunately, this is not safe. Sometimes str(obj) will return
unicode
. Sometimes it will return a byte str
. Sometimes,
it will attempt to convert from a unicode
string to a byte
str
, fail, and throw a UnicodeError
. To be safe from all of
these, first decide whether you need unicode
or str
to be
returned. Then use to_unicode()
or to_bytes()
to get the simple
representation like this:
u_representation = to_unicode(obj, nonstring='simplerepr')
b_representation = to_bytes(obj, nonstring='simplerepr')
print¶
python has a builtin print()
statement that outputs strings to the
terminal. This originated in a time when python only dealt with byte
str
. When unicode
strings came about, some enhancements
were made to the print()
statement so that it could print those as well.
The enhancements make print()
work most of the time. However, the times
when it doesn’t work tend to make for cryptic debugging.
The basic issue is that print()
has to figure out what encoding to use
when it prints a unicode
string to the terminal. When python is
attached to your terminal (ie, you’re running the interpreter or running
a script that prints to the screen) python is able to take the encoding value
from your locale settings LC_ALL
or LC_CTYPE
and print the
characters allowed by that encoding. On most modern Unix systems, the
encoding is utf-8 which means that you can print any unicode
character without problem.
There are two common cases of things going wrong:
Someone has a locale set that does not accept all valid unicode characters. For instance:
$ LC_ALL=C python >>> print u'\ufffd' Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
This often happens when a script that you’ve written and debugged from the terminal is run from an automated environment like cron. It also occurs when you have written a script using a utf-8 aware locale and released it for consumption by people all over the internet. Inevitably, someone is running with a locale that can’t handle all unicode characters and you get a traceback reported.
You redirect output to a file. Python isn’t using the values in
LC_ALL
unconditionally to decide what encoding to use. Instead it is using the encoding set for the terminal you are printing to which is set to accept different encodings byLC_ALL
. If you redirect to a file, you are no longer printing to the terminal soLC_ALL
won’t have any effect. At this point, python will decide it can’t find an encoding and fallback to ASCII which will likely lead toUnicodeError
being raised. You can see this in a short script:#! /usr/bin/python -tt print u'\ufffd'
And then look at the difference between running it normally and redirecting to a file:
$ ./test.py � $ ./test.py > t Traceback (most recent call last): File "test.py", line 3, in <module> print u'\ufffd' UnicodeEncodeError: 'ascii' codec can't encode character u'\ufffd' in position 0: ordinal not in range(128)
The short answer to dealing with this is to always use bytes when writing output. You can do this by explicitly converting to bytes like this:
from kitchen.text.converters import to_bytes
u_string = u'\ufffd'
print to_bytes(u_string)
or you can wrap stdout and stderr with a StreamWriter
.
A StreamWriter
is convenient in that you can assign it to
encode for sys.stdout
or sys.stderr
and then have output
automatically converted but it has the drawback of still being able to throw
UnicodeError
if the writer can’t encode all possible unicode
codepoints. Kitchen provides an alternate version which can be retrieved with
kitchen.text.converters.getwriter()
which will not traceback in its
standard configuration.
Unicode, str, and dict keys¶
The hash()
of the ASCII characters is the same for
unicode
and byte str
. When you use them in dict
keys, they evaluate to the same dictionary slot:
>>> u_string = u'a'
>>> b_string = 'a'
>>> hash(u_string), hash(b_string)
(12416037344, 12416037344)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'a': 'bytes'}
When you deal with key values outside of ASCII, unicode
and
byte str
evaluate unequally no matter what their character content or
hash value:
>>> u_string = u'ñ'
>>> b_string = u_string.encode('utf-8')
>>> print u_string
ñ
>>> print b_string
ñ
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string] = 'bytes'
>>> d
{u'\\xf1': 'unicode', '\\xc3\\xb1': 'bytes'}
>>> b_string2 = '\\xf1'
>>> hash(u_string), hash(b_string2)
(30848092528, 30848092528)
>>> d = {}
>>> d[u_string] = 'unicode'
>>> d[b_string2] = 'bytes'
{u'\\xf1': 'unicode', '\\xf1': 'bytes'}
How do you work with this one? Remember rule #1: Keep your unicode
and byte str
values separate. That goes for keys in a dictionary
just like anything else.
For any given dictionary, make sure that all your keys are either
unicode
orstr
. Do not mix the two. If you’re being given bothunicode
andstr
but you don’t need to preserve separate keys for each, I recommend usingto_unicode()
orto_bytes()
to convert all keys to one type or the other like this:>>> from kitchen.text.converters import to_unicode >>> u_string = u'one' >>> b_string = 'two' >>> d = {} >>> d[to_unicode(u_string)] = 1 >>> d[to_unicode(b_string)] = 2 >>> d {u'two': 2, u'one': 1}
These issues also apply to using dicts with tuple keys that contain a mixture of
unicode
andstr
. Once again the best fix is to standardise on eitherstr
orunicode
.If you absolutely need to store values in a dictionary where the keys could be either
unicode
orstr
you can useStrictDict
which has separate entries for allunicode
and bytestr
and deals correctly with anytuple
containing mixedunicode
and bytestr
.
Functions¶
Unicode and byte str conversion¶
- kitchen.text.converters.to_unicode(obj, encoding='utf-8', errors='replace', nonstring=None, non_string=None)¶
Convert an object into a
str
string- Parameters
obj – Object to convert to a
str
string. This should normally be a bytebytes
encoding – What encoding to try converting the byte
bytes
as. Defaults to utf-8errors – If errors are found while decoding, perform this action. Defaults to
replace
which replaces the invalid bytes with a character that means the bytes were unable to be decoded. Other values are the same as the error handling schemes in the codec base classes. For instancestrict
which raises an exception andignore
which simply omits the non-decodable characters.nonstring –
How to treat nonstring values. Possible values are:
- simplerepr
Attempt to call the object’s “simple representation” method and return that value. Python-2.3+ has two methods that try to return a simple representation:
object.__unicode__()
andobject.__str__()
. We first try to get a usable value fromobject.__unicode__()
. If that fails we try the same withobject.__str__()
.- empty
Return an empty
str
string- strict
Raise a
TypeError
- passthru
Return the object unchanged
- repr
Attempt to return a
str
string of the repr of the object
Default is
simplerepr
non_string – Deprecated Use
nonstring
instead
- Raises
TypeError – if
nonstring
isstrict
and a non-basestring
object is passed in or ifnonstring
is set to an unknown valueUnicodeDecodeError – if
errors
isstrict
andobj
is not decodable using the given encoding
- Returns
str
string or the original object depending on the value ofnonstring
.
Usually this should be used on a byte
bytes
but it can take both bytebytes
andstr
strings intelligently. Nonstring objects are handled in different ways depending on the setting of thenonstring
parameter.The default values of this function are set so as to always return a
str
string and never raise an error when converting from a bytebytes
to astr
string. However, when you do not pass validly encoded text (or a nonstring object), you may end up with output that you don’t expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.Changed in version 0.2.1a2: Deprecated
non_string
in favor ofnonstring
parameter and changed default value tosimplerepr
- kitchen.text.converters.to_bytes(obj, encoding='utf-8', errors='replace', nonstring=None, non_string=None)¶
Convert an object into a byte
bytes
- Parameters
obj – Object to convert to a byte
bytes
. This should normally be astr
string.encoding – Encoding to use to convert the
str
string into a bytebytes
. Defaults to utf-8.errors –
If errors are found while encoding, perform this action. Defaults to
replace
which replaces the invalid bytes with a character that means the bytes were unable to be encoded. Other values are the same as the error handling schemes in the codec base classes. For instancestrict
which raises an exception andignore
which simply omits the non-encodable characters.nonstring –
How to treat nonstring values. Possible values are:
- simplerepr
Attempt to call the object’s “simple representation” method and return that value. Python-2.3+ has two methods that try to return a simple representation:
object.__unicode__()
andobject.__str__()
. We first try to get a usable value fromobject.__str__()
. If that fails we try the same withobject.__unicode__()
.- empty
Return an empty byte
bytes
- strict
Raise a
TypeError
- passthru
Return the object unchanged
- repr
Attempt to return a byte
bytes
of therepr()
of the object
Default is
simplerepr
.non_string – Deprecated Use
nonstring
instead.
- Raises
TypeError – if
nonstring
isstrict
and a non-basestring
object is passed in or ifnonstring
is set to an unknown value.UnicodeEncodeError – if
errors
isstrict
and all of the bytes ofobj
are unable to be encoded usingencoding
.
- Returns
byte
bytes
or the original object depending on the value ofnonstring
.
Warning
If you pass a byte
bytes
into this function the bytebytes
is returned unmodified. It is not re-encoded with the specifiedencoding
. The easiest way to achieve that is:to_bytes(to_unicode(text), encoding='utf-8')
The initial
to_unicode()
call will ensure text is astr
string. Then,to_bytes()
will turn that into a bytebytes
with the specified encoding.Usually, this should be used on a
str
string but it can take either a bytebytes
or astr
string intelligently. Nonstring objects are handled in different ways depending on the setting of thenonstring
parameter.The default values of this function are set so as to always return a byte
bytes
and never raise an error when converting from unicode to bytes. However, when you do not pass an encoding that can validly encode the object (or a non-string object), you may end up with output that you don’t expect. Be sure you understand the requirements of your data, not just ignore errors by passing it through this function.Changed in version 0.2.1a2: Deprecated
non_string
in favor ofnonstring
parameter and changed default value tosimplerepr
- kitchen.text.converters.getwriter(encoding)¶
Return a
codecs.StreamWriter
that resists tracing back.- Parameters
encoding – Encoding to use for transforming
str
strings into bytebytes
.- Return type
codecs.StreamWriter
- Returns
StreamWriter
that you can instantiate to wrap output streams to automatically translatestr
strings intoencoding
.
This is a reimplemetation of
codecs.getwriter()
that returns aStreamWriter
that resists issuing tracebacks. TheStreamWriter
that is returned useskitchen.text.converters.to_bytes()
to convertstr
strings into bytebytes
. The departures fromcodecs.getwriter()
are:The
StreamWriter
that is returned will take bytebytes
as well asstr
strings. Any bytebytes
will be passed through unmodified.The default error handler for unknown bytes is to
replace
the bytes with the unknown character (?
in most ascii-based encodings,�
in the utf encodings) whereascodecs.getwriter()
defaults tostrict
. Likecodecs.StreamWriter
, the returnedStreamWriter
can have its error handler changed in code by settingstream.errors = 'new_handler_name'
Example usage:
$ LC_ALL=C python >>> import sys >>> from kitchen.text.converters import getwriter >>> UTF8Writer = getwriter('utf-8') >>> unwrapped_stdout = sys.stdout >>> sys.stdout = UTF8Writer(unwrapped_stdout) >>> print 'caf\xc3\xa9' café >>> print u'caf\xe9' café >>> ASCIIWriter = getwriter('ascii') >>> sys.stdout = ASCIIWriter(unwrapped_stdout) >>> print 'caf\xc3\xa9' café >>> print u'caf\xe9' caf?
See also
API docs for
codecs.StreamWriter
andcodecs.getwriter()
and Print Fails on the python wiki.New in version kitchen: 0.2a2, API: kitchen.text 1.1.0
- kitchen.text.converters.to_str(obj)¶
Deprecated
This function converts something to a byte
bytes
if it isn’t one. It’s used to callstr()
orunicode()
on the object to get its simple representation without danger of getting aUnicodeError
. You should be usingto_unicode()
orto_bytes()
explicitly instead.If you need
str
strings:to_unicode(obj, nonstring='simplerepr')
If you need byte
bytes
:to_bytes(obj, nonstring='simplerepr')
- kitchen.text.converters.to_utf8(obj, errors='replace', non_string='passthru')¶
Deprecated
Convert
str
to an encoded utf-8 bytebytes
. You should be usingto_bytes()
instead:to_bytes(obj, encoding='utf-8', non_string='passthru')
Transformation to XML¶
- kitchen.text.converters.unicode_to_xml(string, encoding='utf-8', attrib=False, control_chars='replace')¶
Take a
str
string and turn it into a bytebytes
suitable for xml- Parameters
string –
str
string to encode into an XML compatible bytebytes
encoding – encoding to use for the returned byte
bytes
. Default is to encode to UTF-8. If some of the characters instring
are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references.attrib – If
True
, quote the string for use in an xml attribute. IfFalse
(default), quote for use in an xml text field.control_chars –
control characters are not allowed in XML documents. When we encounter those we need to know what to do. Valid options are:
- replace
(default) Replace the control characters with
?
- ignore
Remove the characters altogether from the output
- strict
Raise an
XmlEncodeError
when we encounter a control character
- Raises
kitchen.text.exceptions.XmlEncodeError – If
control_chars
is set tostrict
and the string to be made suitable for output to xml contains control characters or ifstring
is not astr
string then we raise this exception.ValueError – If
control_chars
is set to something other thanreplace
,ignore
, orstrict
.
- Return type
byte
bytes
- Returns
representation of the
str
string as a valid XML bytebytes
XML files consist mainly of text encoded using a particular charset. XML also denies the use of certain bytes in the encoded text (example:
ASCII Null
). There are also special characters that must be escaped if they are present in the input (example:<
). This function takes care of all of those issues for you.There are a few different ways to use this function depending on your needs. The simplest invocation is like this:
unicode_to_xml(u'String with non-ASCII characters: <"á と">')
This will return the following to you, encoded in utf-8:
'String with non-ASCII characters: <"á と">'
Pretty straightforward. Now, what if you need to encode your document in something other than utf-8? For instance,
latin-1
? Let’s see:unicode_to_xml(u'String with non-ASCII characters: <"á と">', encoding='latin-1') 'String with non-ASCII characters: <"á と">'
Because the
と
character is not available in thelatin-1
charset, it is replaced withと
in our output. This is an xml character reference which represents the character at unicode codepoint12392
, theと
character.When you want to reverse this, use
xml_to_unicode()
which will turn a bytebytes
into astr
string and replace the xml character references with the unicode characters.XML also has the quirk of not allowing control characters in its output. The
control_chars
parameter allows us to specify what to do with those. For use cases that don’t need absolute character by character fidelity (example: holding strings that will just be used for display in a GUI app later), the default value ofreplace
works well:unicode_to_xml(u'String with disallowed control chars: \u0000\u0007') 'String with disallowed control chars: ??'
If you do need to be able to reproduce all of the characters at a later date (examples: if the string is a key value in a database or a path on a filesystem) you have many choices. Here are a few that rely on
utf-7
, a verbose encoding that encodes control characters (as well as non-ASCII unicode values) to characters from within the ASCII printable characters. The good thing about doing this is that the code is pretty simple. You just need to useutf-7
both when encoding the field for xml and when decoding it for use in your python program:unicode_to_xml(u'String with unicode: と and control char: ', encoding='utf7') 'String with unicode: +MGg and control char: +AAc-' # [...] xml_to_unicode('String with unicode: +MGg and control char: +AAc-', encoding='utf7') u'String with unicode: と and control char: '
As you can see, the
utf-7
encoding will transform even characters that would be representable in utf-8. This can be a drawback if you want unicode characters in the file to be readable without being decoded first. You can work around this with increased complexity in your application code:encoding = 'utf-8' u_string = u'String with unicode: と and control char: ' try: # First attempt to encode to utf8 data = unicode_to_xml(u_string, encoding=encoding, errors='strict') except XmlEncodeError: # Fallback to utf-7 encoding = 'utf-7' data = unicode_to_xml(u_string, encoding=encoding, errors='strict') write_tag('<mytag encoding=%s>%s</mytag>' % (encoding, data)) # [...] encoding = tag.attributes.encoding u_string = xml_to_unicode(u_string, encoding=encoding)
Using code similar to that, you can have some fields encoded using your default encoding and fallback to
utf-7
if there are control characters present.Note
If your goal is to preserve the control characters you cannot save the entire file as
utf-7
and set the xml encoding parameter toutf-7
if your goal is to preserve the control characters. Because XML doesn’t allow control characters, you have to encode those separate from any encoding work that the XML parser itself knows about.See also
bytes_to_xml()
if you’re dealing with bytes that are non-text or of an unknown encoding that you must preserve on a byte for byte level.
guess_encoding_to_xml()
if you’re dealing with strings in unknown encodings that you don’t need to save with char-for-char fidelity.
- kitchen.text.converters.xml_to_unicode(byte_string, encoding='utf-8', errors='replace')¶
Transform a byte
bytes
from an xml file into astr
string- Parameters
byte_string – byte
bytes
to decodeencoding – encoding that the byte
bytes
is inerrors – What to do if not every character is valid in
encoding
. See theto_unicode()
documentation for legal values.
- Return type
str
string- Returns
string decoded from
byte_string
This function attempts to reverse what
unicode_to_xml()
does. It takes a bytebytes
(presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the bytebytes
into astr
string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to usexml_to_bytes()
andbytes_to_xml()
or use on of the strategies documented inunicode_to_xml()
instead.
- kitchen.text.converters.byte_string_to_xml(byte_string, input_encoding='utf-8', errors='replace', output_encoding='utf-8', attrib=False, control_chars='replace')¶
Make sure a byte
bytes
is validly encoded for xml output- Parameters
byte_string – Byte
bytes
to turn into valid xml outputinput_encoding – Encoding of
byte_string
. Defaultutf-8
errors –
How to handle errors encountered while decoding the
byte_string
intostr
at the beginning of the process. Values are:- replace
(default) Replace the invalid bytes with a
?
- ignore
Remove the characters altogether from the output
- strict
Raise an
UnicodeDecodeError
when we encounter a non-decodable character
output_encoding – Encoding for the xml file that this string will go into. Default is
utf-8
. If all the characters inbyte_string
are not encodable in this encoding, the unknown characters will be entered into the output string using xml character references.attrib – If
True
, quote the string for use in an xml attribute. IfFalse
(default), quote for use in an xml text field.control_chars –
XML does not allow control characters. When we encounter those we need to know what to do. Valid options are:
- replace
(default) Replace the control characters with
?
- ignore
Remove the characters altogether from the output
- strict
Raise an error when we encounter a control character
- Raises
XmlEncodeError – If
control_chars
is set tostrict
and the string to be made suitable for output to xml contains control characters then we raise this exception.UnicodeDecodeError – If errors is set to
strict
and thebyte_string
contains bytes that are not decodable usinginput_encoding
, this error is raised
- Return type
byte
bytes
- Returns
representation of the byte
bytes
in the output encoding with any bytes that aren’t available in xml taken care of.
Use this when you have a byte
bytes
representing text that you need to make suitable for output to xml. There are several cases where this is the case. For instance, if you need to transform some strings encoded inlatin-1
to utf-8 for output:utf8_string = byte_string_to_xml(latin1_string, input_encoding='latin-1')
If you already have strings in the proper encoding you may still want to use this function to remove control characters:
cleaned_string = byte_string_to_xml(string, input_encoding='utf-8', output_encoding='utf-8')
See also
unicode_to_xml()
for other ideas on using this function
- kitchen.text.converters.xml_to_byte_string(byte_string, input_encoding='utf-8', errors='replace', output_encoding='utf-8')¶
Transform a byte
bytes
from an xml file intostr
string- Parameters
byte_string – byte
bytes
to decodeinput_encoding – encoding that the byte
bytes
is inerrors – What to do if not every character is valid in
encoding
. See theto_unicode()
docstring for legal values.output_encoding – Encoding for the output byte
bytes
- Returns
str
string decoded frombyte_string
This function attempts to reverse what
unicode_to_xml()
does. It takes a bytebytes
(presumably read in from an xml file) and expands all the html entities into unicode characters and decodes the bytebytes
into astr
string. One thing it cannot do is restore any control characters that were removed prior to inserting into the file. If you need to keep such characters you need to usexml_to_bytes()
andbytes_to_xml()
or use one of the strategies documented inunicode_to_xml()
instead.
- kitchen.text.converters.bytes_to_xml(byte_string, *args, **kwargs)¶
Return a byte
bytes
encoded so it is valid inside of any xml file- Parameters
byte_string – byte
bytes
to transform**kwargs (*args,) – extra arguments to this function are passed on to the function actually implementing the encoding. You can use this to tweak the output in some cases but, as a general rule, you shouldn’t because the underlying encoding function is not guaranteed to remain the same.
- Return type
byte
bytes
consisting of all ASCII characters- Returns
byte
bytes
representation of the input. This will be encoded using base64.
This function is made especially to put binary information into xml documents.
This function is intended for encoding things that must be preserved byte-for-byte. If you want to encode a byte string that’s text and don’t mind losing the actual bytes you probably want to try
byte_string_to_xml()
orguess_encoding_to_xml()
instead.Note
Although the current implementation uses
base64.b64encode()
and there’s no plans to change it, that isn’t guaranteed. If you want to make sure that you can encode and decode these messages it’s best to usexml_to_bytes()
if you use this function to encode.
- kitchen.text.converters.xml_to_bytes(byte_string, *args, **kwargs)¶
Decode a string encoded using
bytes_to_xml()
- Parameters
byte_string – byte
bytes
to transform. This should be a base64 encoded sequence of bytes originally generated bybytes_to_xml()
.**kwargs (*args,) – extra arguments to this function are passed on to the function actually implementing the encoding. You can use this to tweak the output in some cases but, as a general rule, you shouldn’t because the underlying encoding function is not guaranteed to remain the same.
- Return type
byte
bytes
- Returns
byte
bytes
that’s the decoded input
If you’ve got fields in an xml document that were encoded with
bytes_to_xml()
then you want to use this function to undecode them. It converts a base64 encoded string into a bytebytes
.Note
Although the current implementation uses
base64.b64decode()
and there’s no plans to change it, that isn’t guaranteed. If you want to make sure that you can encode and decode these messages it’s best to usebytes_to_xml()
if you use this function to decode.
- kitchen.text.converters.guess_encoding_to_xml(string, output_encoding='utf-8', attrib=False, control_chars='replace')¶
Return a byte
bytes
suitable for inclusion in xml- Parameters
string –
str
or bytebytes
to be transformed into a bytebytes
suitable for inclusion in xml. If string is a bytebytes
we attempt to guess the encoding. If we cannot guess, we fallback tolatin-1
.output_encoding – Output encoding for the byte
bytes
. This should match the encoding of your xml file.attrib – If
True
, escape the item for use in an xml attribute. IfFalse
(default) escape the item for use in a text node.
- Returns
utf-8 encoded byte
bytes
- kitchen.text.converters.to_xml(string, encoding='utf-8', attrib=False, control_chars='ignore')¶
Deprecated: Use
guess_encoding_to_xml()
instead
Working with exception messages¶
- kitchen.text.converters.EXCEPTION_CONVERTERS = (<function <lambda>>, <function <lambda>>)¶
- Tuple of functions to try to use to convert an exception into a string
representation. Its main use is to extract a string (
str
orbytes
) from an exception object inexception_to_unicode()
andexception_to_bytes()
. The functions here will try the exception’sargs[0]
and the exception itself (roughly equivalent to str(exception)) to extract the message. This is only a default and can be easily overridden when calling those functions. There are several reasons you might wish to do that. If you have exceptions where the best string representing the exception is not returned by the default functions, you can add another function to extract from a different field:from kitchen.text.converters import (EXCEPTION_CONVERTERS, exception_to_unicode) class MyError(Exception): def __init__(self, message): self.value = message c = [lambda e: e.value] c.extend(EXCEPTION_CONVERTERS) try: raise MyError('An Exception message') except MyError, e: print exception_to_unicode(e, converters=c)
Another reason would be if you’re converting to a byte
bytes
and you know thebytes
needs to be a non-utf-8 encoding.exception_to_bytes()
defaults to utf-8 but if you convert into a bytebytes
explicitly using a converter then you can choose a different encoding:from kitchen.text.converters import (EXCEPTION_CONVERTERS, exception_to_bytes, to_bytes) c = [lambda e: to_bytes(e.args[0], encoding='euc_jp'), lambda e: to_bytes(e, encoding='euc_jp')] c.extend(EXCEPTION_CONVERTERS) try: do_something() except Exception, e: log = open('logfile.euc_jp', 'a') log.write('%s
- ‘ % exception_to_bytes(e, converters=c)
log.close()
Each function in this list should take the exception as its sole argument and return a string containing the message representing the exception. The functions may return the message as a :byte class:bytes, a
str
string, or even an object if you trust the object to return a decent string representation. Theexception_to_unicode()
andexception_to_bytes()
functions will make sure to convert the string to the proper type before returning.New in version 0.2.2.
- kitchen.text.converters.BYTE_EXCEPTION_CONVERTERS = (<function <lambda>>, <function to_bytes>)¶
Deprecated: Use
EXCEPTION_CONVERTERS
instead.Tuple of functions to try to use to convert an exception into a string representation. This tuple is similar to the one in
EXCEPTION_CONVERTERS
but it’s used withexception_to_bytes()
instead. Ideally, these functions should do their best to return the data as a bytebytes
but the results will be run throughto_bytes()
before being returned.New in version 0.2.2.
Changed in version 1.0.1: Deprecated as simplifications allow
EXCEPTION_CONVERTERS
to perform the same function.
- kitchen.text.converters.exception_to_unicode(exc, converters=(<function <lambda>>, <function <lambda>>))¶
Convert an exception object into a unicode representation
- Parameters
exc – Exception object to convert
converters – List of functions to use to convert the exception into a string. See
EXCEPTION_CONVERTERS
for the default value and an example of adding other converters to the defaults. The functions in the list are tried one at a time to see if they can extract a string from the exception. The first one to do so without raising an exception is used.
- Returns
str
string representation of the exception. The value extracted by theconverters
will be converted intostr
before being returned using the utf-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions inconverters
)
New in version 0.2.2.
- kitchen.text.converters.exception_to_bytes(exc, converters=(<function <lambda>>, <function <lambda>>))¶
Convert an exception object into a str representation
- Parameters
exc – Exception object to convert
converters – List of functions to use to convert the exception into a string. See
EXCEPTION_CONVERTERS
for the default value and an example of adding other converters to the defaults. The functions in the list are tried one at a time to see if they can extract a string from the exception. The first one to do so without raising an exception is used.
- Returns
byte
bytes
representation of the exception. The value extracted by theconverters
will be converted intobytes
before being returned using the utf-8 encoding. If you know you need to use an alternate encoding add a function that does that to the list of functions inconverters
)
New in version 0.2.2.
Changed in version 1.0.1: Code simplification allowed us to switch to using
EXCEPTION_CONVERTERS
as the default value ofconverters
.