Quickstart¶
If you prefer to learn by diving in and getting your feet wet, then here are some cut-and-pasteable examples to play with.
First, let’s import stuff and get some data to work with:
In [1]: import numpy as np
In [2]: from patsy import dmatrices, dmatrix, demo_data
In [3]: data = demo_data("a", "b", "x1", "x2", "y", "z column")
demo_data()
gives us a mix of categorical and numerical
variables:
In [4]: data
Out[4]:
{'a': ['a1', 'a1', 'a2', 'a2', 'a1', 'a1', 'a2', 'a2'],
'b': ['b1', 'b2', 'b1', 'b2', 'b1', 'b2', 'b1', 'b2'],
'x1': array([ 1.76405235, 0.40015721, 0.97873798, 2.2408932 , 1.86755799,
-0.97727788, 0.95008842, -0.15135721]),
'x2': array([-0.10321885, 0.4105985 , 0.14404357, 1.45427351, 0.76103773,
0.12167502, 0.44386323, 0.33367433]),
'y': array([ 1.49407907, -0.20515826, 0.3130677 , -0.85409574, -2.55298982,
0.6536186 , 0.8644362 , -0.74216502]),
'z column': array([ 2.26975462, -1.45436567, 0.04575852, -0.18718385, 1.53277921,
1.46935877, 0.15494743, 0.37816252])}
Of course Patsy doesn’t much care what sort of object you store
your data in, so long as it can be indexed like a Python dictionary,
data[varname]
. You may prefer to store your data in a pandas DataFrame, or a numpy
record array… whatever makes you happy.
Now, let’s generate design matrices suitable for regressing y
onto
x1
and x2
.
In [5]: dmatrices("y ~ x1 + x2", data)
Out[5]:
(DesignMatrix with shape (8, 1)
y
1.49408
-0.20516
0.31307
-0.85410
-2.55299
0.65362
0.86444
-0.74217
Terms:
'y' (column 0),
DesignMatrix with shape (8, 3)
Intercept x1 x2
1 1.76405 -0.10322
1 0.40016 0.41060
1 0.97874 0.14404
1 2.24089 1.45427
1 1.86756 0.76104
1 -0.97728 0.12168
1 0.95009 0.44386
1 -0.15136 0.33367
Terms:
'Intercept' (column 0)
'x1' (column 1)
'x2' (column 2))
The return value is a Python tuple containing two DesignMatrix
objects, the first representing the left-hand side of our formula, and
the second representing the right-hand side. Notice that an intercept
term was automatically added to the right-hand side. These are just
ordinary numpy arrays with some extra metadata and a fancy __repr__
method attached, so we can pass them directly to a regression function
like np.linalg.lstsq()
:
In [6]: outcome, predictors = dmatrices("y ~ x1 + x2", data)
In [7]: betas = np.linalg.lstsq(predictors, outcome)[0].ravel()
In [8]: for name, beta in zip(predictors.design_info.column_names, betas):
...: print("%s: %s" % (name, beta))
...:
Intercept: 0.5796623441231167
x1: 0.0885991903553558
x2: -1.7647920555065
Of course the resulting numbers aren’t very interesting, since this is just random data.
If you just want the design matrix alone, without the y
values,
use dmatrix()
and leave off the y ~
part at the beginning:
In [9]: dmatrix("x1 + x2", data)
Out[9]:
DesignMatrix with shape (8, 3)
Intercept x1 x2
1 1.76405 -0.10322
1 0.40016 0.41060
1 0.97874 0.14404
1 2.24089 1.45427
1 1.86756 0.76104
1 -0.97728 0.12168
1 0.95009 0.44386
1 -0.15136 0.33367
Terms:
'Intercept' (column 0)
'x1' (column 1)
'x2' (column 2)
We’ll use dmatrix for the rest of the examples, since seeing the
outcome matrix over and over would get boring. This matrix’s metadata
is stored in an extra attribute called .design_info
, which is a
DesignInfo
object you can explore at your leisure:
In [10]: d = dmatrix("x1 + x2", data)
In [11]: d.design_info.<TAB>
d.design_info.builder d.design_info.slice
d.design_info.column_name_indexes d.design_info.term_name_slices
d.design_info.column_names d.design_info.term_names
d.design_info.describe d.design_info.term_slices
d.design_info.linear_constraint d.design_info.terms
Usually the intercept is useful, but if we don’t want it we can get rid of it:
In [12]: dmatrix("x1 + x2 - 1", data)
Out[12]:
DesignMatrix with shape (8, 2)
x1 x2
1.76405 -0.10322
0.40016 0.41060
0.97874 0.14404
2.24089 1.45427
1.86756 0.76104
-0.97728 0.12168
0.95009 0.44386
-0.15136 0.33367
Terms:
'x1' (column 0)
'x2' (column 1)
We can transform variables using arbitrary Python code:
In [13]: dmatrix("x1 + np.log(x2 + 10)", data)
Out[13]:
DesignMatrix with shape (8, 3)
Intercept x1 np.log(x2 + 10)
1 1.76405 2.29221
1 0.40016 2.34282
1 0.97874 2.31689
1 2.24089 2.43836
1 1.86756 2.37593
1 -0.97728 2.31468
1 0.95009 2.34601
1 -0.15136 2.33541
Terms:
'Intercept' (column 0)
'x1' (column 1)
'np.log(x2 + 10)' (column 2)
Notice that np.log
is being pulled out of the environment where
dmatrix()
was called – np.log
is accessible because we did
import numpy as np
up above. Any functions or variables that you
could reference when calling dmatrix()
can also be used inside
the formula passed to dmatrix()
. For example:
In [14]: new_x2 = data["x2"] * 100
In [15]: dmatrix("new_x2")
Out[15]:
DesignMatrix with shape (8, 2)
Intercept new_x2
1 -10.32189
1 41.05985
1 14.40436
1 145.42735
1 76.10377
1 12.16750
1 44.38632
1 33.36743
Terms:
'Intercept' (column 0)
'new_x2' (column 1)
Patsy has some transformation functions “built in”, that are automatically accessible to your code:
In [16]: dmatrix("center(x1) + standardize(x2)", data)
Out[16]:
DesignMatrix with shape (8, 3)
Intercept center(x1) standardize(x2)
1 0.87995 -1.21701
1 -0.48395 -0.07791
1 0.09463 -0.66885
1 1.35679 2.23584
1 0.98345 0.69899
1 -1.86138 -0.71844
1 0.06598 -0.00417
1 -1.03546 -0.24845
Terms:
'Intercept' (column 0)
'center(x1)' (column 1)
'standardize(x2)' (column 2)
See patsy.builtins
for a complete list of functions made
available to formulas. You can also define your own transformation
functions in the ordinary Python way:
In [17]: def double(x):
....: return 2 * x
....:
In [18]: dmatrix("x1 + double(x1)", data)
Out[18]:
DesignMatrix with shape (8, 3)
Intercept x1 double(x1)
1 1.76405 3.52810
1 0.40016 0.80031
1 0.97874 1.95748
1 2.24089 4.48179
1 1.86756 3.73512
1 -0.97728 -1.95456
1 0.95009 1.90018
1 -0.15136 -0.30271
Terms:
'Intercept' (column 0)
'x1' (column 1)
'double(x1)' (column 2)
This flexibility does create problems in one case, though – because
we interpret whatever you write in-between the +
signs as Python
code, you do in fact have to write valid Python code. And this can be
tricky if your variable names have funny characters in them, like
whitespace or punctuation. Fortunately, patsy has a builtin
“transformation” called Q()
that lets you “quote” such
variables:
In [19]: weird_data = demo_data("weird column!", "x1")
# This is an error...
In [20]: dmatrix("weird column! + x1", weird_data)
[...]
PatsyError: error tokenizing input (maybe an unclosed string?)
weird column! + x1
^
# ...but this works:
In [21]: dmatrix("Q('weird column!') + x1", weird_data)
Out[21]:
DesignMatrix with shape (5, 3)
Intercept Q('weird column!') x1
1 1.76405 -0.97728
1 0.40016 0.95009
1 0.97874 -0.15136
1 2.24089 -0.10322
1 1.86756 0.41060
Terms:
'Intercept' (column 0)
"Q('weird column!')" (column 1)
'x1' (column 2)
Q()
even plays well with other transformations:
In [22]: dmatrix("double(Q('weird column!')) + x1", weird_data)
Out[22]:
DesignMatrix with shape (5, 3)
Intercept double(Q('weird column!')) x1
1 3.52810 -0.97728
1 0.80031 0.95009
1 1.95748 -0.15136
1 4.48179 -0.10322
1 3.73512 0.41060
Terms:
'Intercept' (column 0)
"double(Q('weird column!'))" (column 1)
'x1' (column 2)
Arithmetic transformations are also possible, but you’ll need to
“protect” them by wrapping them in I()
, so that Patsy knows
that you really do want +
to mean addition:
In [23]: dmatrix("I(x1 + x2)", data) # compare to "x1 + x2"
Out[23]:
DesignMatrix with shape (8, 2)
Intercept I(x1 + x2)
1 1.66083
1 0.81076
1 1.12278
1 3.69517
1 2.62860
1 -0.85560
1 1.39395
1 0.18232
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
Note that while Patsy goes to considerable efforts to take in data represented using different Python data types and convert them into a standard representation, all this work happens after any transformations you perform as part of your formula. So, for example, if your data is in the form of numpy arrays, “+” will perform element-wise addition, but if it is in standard Python lists, it will perform concatenation:
In [24]: dmatrix("I(x1 + x2)", {"x1": np.array([1, 2, 3]), "x2": np.array([4, 5, 6])})
Out[24]:
DesignMatrix with shape (3, 2)
Intercept I(x1 + x2)
1 5
1 7
1 9
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
In [25]: dmatrix("I(x1 + x2)", {"x1": [1, 2, 3], "x2": [4, 5, 6]})
Out[25]:
DesignMatrix with shape (6, 2)
Intercept I(x1 + x2)
1 1
1 2
1 3
1 4
1 5
1 6
Terms:
'Intercept' (column 0)
'I(x1 + x2)' (column 1)
Patsy becomes particularly useful when you have categorical data. If you use a predictor that has a categorical type (e.g. strings or bools), it will be automatically coded. Patsy automatically chooses an appropriate way to code categorical data to avoid producing a redundant, overdetermined model.
If there is just one categorical variable alone, the default is to dummy code it:
In [26]: dmatrix("0 + a", data)
Out[26]:
DesignMatrix with shape (8, 2)
a[a1] a[a2]
1 0
1 0
0 1
0 1
1 0
1 0
0 1
0 1
Terms:
'a' (columns 0:2)
But if you did that and put the intercept back in, you’d get a redundant model. So if the intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by default):
In [27]: dmatrix("a", data)
Out[27]:
DesignMatrix with shape (8, 2)
Intercept a[T.a2]
1 0
1 0
1 1
1 1
1 0
1 0
1 1
1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
The T.
notation is there to remind you that these columns are
treatment coded.
Interactions are also easy – they represent the cartesian product of
all the factors involved. Here’s a dummy coding of each combination
of values taken by a
and b
:
In [28]: dmatrix("0 + a:b", data)
Out[28]:
DesignMatrix with shape (8, 4)
a[a1]:b[b1] a[a2]:b[b1] a[a1]:b[b2] a[a2]:b[b2]
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
Terms:
'a:b' (columns 0:4)
But interactions also know how to use contrast coding to avoid redundancy. If you have both main effects and interactions in a model, then Patsy goes from lower-order effects to higher-order effects, adding in just enough columns to produce a well-defined model. The result is that each set of columns measures the additional contribution of this effect – just what you want for a traditional ANOVA:
In [29]: dmatrix("a + b + a:b", data)
Out[29]:
DesignMatrix with shape (8, 4)
Intercept a[T.a2] b[T.b2] a[T.a2]:b[T.b2]
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
'b' (column 2)
'a:b' (column 3)
Since this is so common, there’s a convenient short-hand:
In [30]: dmatrix("a*b", data)
Out[30]:
DesignMatrix with shape (8, 4)
Intercept a[T.a2] b[T.b2] a[T.a2]:b[T.b2]
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 1
Terms:
'Intercept' (column 0)
'a' (column 1)
'b' (column 2)
'a:b' (column 3)
Of course you can use other coding schemes too (or even define your own). Here’s orthogonal polynomial coding
:
In [31]: dmatrix("C(c, Poly)", {"c": ["c1", "c1", "c2", "c2", "c3", "c3"]})
Out[31]:
DesignMatrix with shape (6, 3)
Intercept C(c, Poly).Linear C(c, Poly).Quadratic
1 -0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 -0.00000 -0.81650
1 0.70711 0.40825
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(c, Poly)' (columns 1:3)
You can even write interactions between categorical and numerical
variables. Here we fit two different slope coefficients for x1
;
one for the a1
group, and one for the a2
group:
In [32]: dmatrix("a:x1", data)
Out[32]:
DesignMatrix with shape (8, 3)
Intercept a[a1]:x1 a[a2]:x1
1 1.76405 0.00000
1 0.40016 0.00000
1 0.00000 0.97874
1 0.00000 2.24089
1 1.86756 0.00000
1 -0.97728 -0.00000
1 0.00000 0.95009
1 -0.00000 -0.15136
Terms:
'Intercept' (column 0)
'a:x1' (columns 1:3)
The same redundancy avoidance code works here, so if you’d rather have
treatment-coded slopes (one slope for the a1
group, and a second
for the difference between the a1
and a2
group slopes), then
you can request it like this:
# compare to the difference between "0 + a" and "1 + a"
In [33]: dmatrix("x1 + a:x1", data)
Out[33]:
DesignMatrix with shape (8, 3)
Intercept x1 a[T.a2]:x1
1 1.76405 0.00000
1 0.40016 0.00000
1 0.97874 0.97874
1 2.24089 2.24089
1 1.86756 0.00000
1 -0.97728 -0.00000
1 0.95009 0.95009
1 -0.15136 -0.15136
Terms:
'Intercept' (column 0)
'x1' (column 1)
'a:x1' (column 2)
And more complex expressions work too:
In [34]: dmatrix("C(a, Poly):center(x1)", data)
Out[34]:
DesignMatrix with shape (8, 3)
Intercept C(a, Poly).Constant:center(x1) C(a, Poly).Linear:center(x1)
1 0.87995 -0.62222
1 -0.48395 0.34220
1 0.09463 0.06691
1 1.35679 0.95939
1 0.98345 -0.69541
1 -1.86138 1.31620
1 0.06598 0.04666
1 -1.03546 -0.73218
Terms:
'Intercept' (column 0)
'C(a, Poly):center(x1)' (columns 1:3)