Python 2 versus Python 3¶
The biggest difference between Python 2 and Python 3 is in their
string handling, and this is particularly relevant to Patsy since
it parses user input. We follow a simple rule: input to Patsy
should always be of type str
. That means that on Python 2, you
should pass byte-strings (not unicode), and on Python 3, you should
pass unicode strings (not byte-strings). Similarly, when Patsy
passes text back (e.g. DesignInfo.column_names
), it’s always
in the form of a str
.
In addition to this being the most convenient for users (you never
need to use any b”weird” u”prefixes” when writing a formula string),
it’s actually a necessary consequence of a deeper change in the Python
language: in Python 2, Python code itself is represented as
byte-strings, and that’s the only form of input accepted by the
tokenize
module. On the other hand, Python 3’s tokenizer and
parser use unicode, and since Patsy processes Python code, it has
to follow suit.
There is one exception to this rule: on Python 2, as a convenience for
those using from __future__ import unicode_literals
, the
high-level API functions dmatrix()
, dmatrices()
,
incr_dbuilders()
, and incr_dbuilder()
do accept
unicode
strings – BUT these unicode string objects are still
required to contain only ASCII characters; if they contain any
non-ASCII characters then an error will be raised. If you really need
non-ASCII in your formulas, then you should consider upgrading to
Python 3. Low-level APIs like ModelDesc.from_formula()
continue
to insist on str
objects only.