Coding categorical data¶
Patsy allows great flexibility in how categorical data is coded,
via the function C()
. C()
marks some data as being
categorical (including data which would not automatically be treated
as categorical, such as a column of integers), while also optionally
setting the preferred coding scheme and level ordering.
Let’s get some categorical data to work with:
In [1]: from patsy import dmatrix, demo_data, ContrastMatrix, Poly
In [2]: data = demo_data("a", nlevels=3)
In [3]: data
Out[3]: {'a': ['a1', 'a2', 'a3', 'a1', 'a2', 'a3']}
As you know, simply giving Patsy a categorical variable causes it
to be coded using the default Treatment
coding
scheme. (Strings and booleans are treated as categorical by default.)
In [4]: dmatrix("a", data)
Out[4]:
DesignMatrix with shape (6, 3)
Intercept a[T.a2] a[T.a3]
1 0 0
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
Terms:
'Intercept' (column 0)
'a' (columns 1:3)
We can also alter the level ordering, which is useful for, e.g.,
Diff
coding:
In [5]: l = ["a3", "a2", "a1"]
In [6]: dmatrix("C(a, levels=l)", data)
Out[6]:
DesignMatrix with shape (6, 3)
Intercept C(a, levels=l)[T.a2] C(a, levels=l)[T.a1]
1 0 1
1 1 0
1 0 0
1 0 1
1 1 0
1 0 0
Terms:
'Intercept' (column 0)
'C(a, levels=l)' (columns 1:3)
But the default coding is just that – a default. The easiest alternative is to use one of the other built-in coding schemes, like orthogonal polynomial coding:
In [7]: dmatrix("C(a, Poly)", data)
Out[7]:
DesignMatrix with shape (6, 3)
Intercept C(a, Poly).Linear C(a, Poly).Quadratic
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
1 -0.70711 0.40825
1 -0.00000 -0.81650
1 0.70711 0.40825
Terms:
'Intercept' (column 0)
'C(a, Poly)' (columns 1:3)
There are a number of built-in coding schemes; for details you can check the API reference. But we aren’t restricted to those. We can also provide a custom contrast matrix, which allows us to produce all kinds of strange designs:
In [8]: contrast = [[1, 2], [3, 4], [5, 6]]
In [9]: dmatrix("C(a, contrast)", data)
Out[9]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast)[custom0] C(a, contrast)[custom1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast)' (columns 1:3)
In [10]: dmatrix("C(a, [[1], [2], [-4]])", data)
Out[10]:
DesignMatrix with shape (6, 2)
Intercept C(a, [[1], [2], [-4]])[custom0]
1 1
1 2
1 -4
1 1
1 2
1 -4
Terms:
'Intercept' (column 0)
'C(a, [[1], [2], [-4]])' (column 1)
Hmm, those [custom0]
, [custom1]
names that Patsy
auto-generated for us are a bit ugly looking. We can attach names to
our contrast matrix by creating a ContrastMatrix
object, and
make things prettier:
In [11]: contrast_mat = ContrastMatrix(contrast, ["[pretty0]", "[pretty1]"])
In [12]: dmatrix("C(a, contrast_mat)", data)
Out[12]:
DesignMatrix with shape (6, 3)
Intercept C(a, contrast_mat)[pretty0] C(a, contrast_mat)[pretty1]
1 1 2
1 3 4
1 5 6
1 1 2
1 3 4
1 5 6
Terms:
'Intercept' (column 0)
'C(a, contrast_mat)' (columns 1:3)
And, finally, if we want to get really fancy, we can also define our
own “smart” coding schemes like Poly
. Just define a class
that has two methods, code_with_intercept()
and
code_without_intercept()
. They have identical signatures, taking
a list of levels as their argument and returning a
ContrastMatrix
. Patsy will automatically choose the
appropriate method to call to produce a full-rank design matrix
without redundancy; see Redundancy and categorical factors for the full details on how
Patsy makes this decision.
As an example, here’s a simplified version of the built-in
Treatment
coding object:
import numpy as np
class MyTreat(object):
def __init__(self, reference=0):
self.reference = reference
def code_with_intercept(self, levels):
return ContrastMatrix(np.eye(len(levels)),
["[My.%s]" % (level,) for level in levels])
def code_without_intercept(self, levels):
eye = np.eye(len(levels) - 1)
contrasts = np.vstack((eye[:self.reference, :],
np.zeros((1, len(levels) - 1)),
eye[self.reference:, :]))
suffixes = ["[MyT.%s]" % (level,) for level in
levels[:self.reference] + levels[self.reference + 1:]]
return ContrastMatrix(contrasts, suffixes)
And it can now be used just like the built-in methods:
# Full rank:
In [13]: dmatrix("0 + C(a, MyTreat)", data)
Out[13]:
DesignMatrix with shape (6, 3)
C(a, MyTreat)[My.a1] C(a, MyTreat)[My.a2] C(a, MyTreat)[My.a3]
1 0 0
0 1 0
0 0 1
1 0 0
0 1 0
0 0 1
Terms:
'C(a, MyTreat)' (columns 0:3)
# Reduced rank:
In [14]: dmatrix("C(a, MyTreat)", data)
Out[14]:
DesignMatrix with shape (6, 3)
Intercept C(a, MyTreat)[MyT.a2] C(a, MyTreat)[MyT.a3]
1 0 0
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
Terms:
'Intercept' (column 0)
'C(a, MyTreat)' (columns 1:3)
# With argument:
In [15]: dmatrix("C(a, MyTreat(2))", data)
Out[15]:
DesignMatrix with shape (6, 3)
Intercept C(a, MyTreat(2))[MyT.a1] C(a, MyTreat(2))[MyT.a2]
1 1 0
1 0 1
1 0 0
1 1 0
1 0 1
1 0 0
Terms:
'Intercept' (column 0)
'C(a, MyTreat(2))' (columns 1:3)