Version 0.16.1 (May 11, 2015)¶
This is a minor bug-fix release from 0.16.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.
Highlights include:
Support for a
CategoricalIndex
, a category based index, see hereNew section on how-to-contribute to pandas, see here
Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, see here
New method
sample
for drawing random samples from Series, DataFrames and Panels. See hereThe default
Index
printing has changed to a more uniform format, see hereBusinessHour
datetime-offset is now supported, see hereFurther enhancement to the
.str
accessor to make string operations easier, see here
Warning
In pandas 0.17.0, the sub-package pandas.io.data
will be removed in favor of a separately installable package (GH8961).
Enhancements¶
CategoricalIndex¶
We introduce a CategoricalIndex
, a new type of index object that is useful for supporting
indexing with duplicates. This is a container around a Categorical
(introduced in v0.15.0)
and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1,
setting the index of a DataFrame/Series
with a category
dtype would convert this to regular object-based Index
.
In [1]: df = pd.DataFrame({'A': np.arange(6),
...: 'B': pd.Series(list('aabbca'))
...: .astype('category', categories=list('cab'))
...: })
...:
In [2]: df
Out[2]:
A B
0 0 a
1 1 a
2 2 b
3 3 b
4 4 c
5 5 a
In [3]: df.dtypes
Out[3]:
A int64
B category
dtype: object
In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')
setting the index, will create a CategoricalIndex
In [5]: df2 = df.set_index('B')
In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
indexing with __getitem__/.iloc/.loc/.ix
works similarly to an Index with duplicates.
The indexers MUST be in the category or the operation will raise.
In [7]: df2.loc['a']
Out[7]:
A
B
a 0
a 1
a 5
and preserves the CategoricalIndex
In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
sorting will order by the order of the categories
In [9]: df2.sort_index()
Out[9]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
groupby operations on the index will preserve the index nature as well
In [10]: df2.groupby(level=0).sum()
Out[10]:
A
B
c 4
a 6
b 5
In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')
reindexing operations, will return a resulting index based on the type of the passed
indexer, meaning that passing a list will return a plain-old-Index
; indexing with
a Categorical
will return a CategoricalIndex
, indexed according to the categories
of the PASSED Categorical
dtype. This allows one to arbitrarily index these even with
values NOT in the categories, similarly to how you can reindex ANY pandas index.
In [12]: df2.reindex(['a', 'e'])
Out[12]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
A
B
a 0.0
a 1.0
a 5.0
e NaN
In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
categories=['a', 'b', 'c', 'd', 'e'],
ordered=False, name='B',
dtype='category')
See the documentation for more. (GH7629, GH10038, GH10039)
Sample¶
Series, DataFrames, and Panels now have a new method: sample()
.
The method accepts a specific number of rows or columns to return, or a fraction of the
total number or rows or columns. It also has options for sampling with or without replacement,
for passing in a column for weights for non-uniform sampling, and for setting seed values to
facilitate replication. (GH2419)
In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])
# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]:
3 3
Length: 1, dtype: int64
# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]:
2 2
1 1
0 0
Length: 3, dtype: int64
# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]:
1 1
5 5
3 3
Length: 3, dtype: int64
# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]:
2 2
4 4
3 3
Length: 3, dtype: int64
# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]
In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]:
0 0
Length: 1, dtype: int64
When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.
In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})
In [10]: df.sample(n=3, weights="weight_column")
Out[10]:
col1 weight_column
0 9 0.5
1 8 0.4
2 7 0.1
[3 rows x 2 columns]
String methods enhancements¶
Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.
Added
StringMethods
(.str
accessor) toIndex
(GH9068)The
.str
accessor is now available for bothSeries
andIndex
.In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"]) In [12]: idx.str.strip() Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
One special case for the
.str
accessor onIndex
is that if a string method returnsbool
, the.str
accessor will return anp.array
instead of a booleanIndex
(GH8875). This enables the following expression to work naturally:In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"]) In [14]: s = pd.Series(range(4), index=idx) In [15]: s Out[15]: a1 0 a2 1 b1 2 b2 3 Length: 4, dtype: int64 In [16]: idx.str.startswith("a") Out[16]: array([ True, True, False, False]) In [17]: s[s.index.str.startswith("a")] Out[17]: a1 0 a2 1 Length: 2, dtype: int64
The following new methods are accessible via
.str
accessor to apply the function to each values. (GH9766, GH9773, GH10031, GH10045, GH10052)Methods
capitalize()
swapcase()
normalize()
partition()
rpartition()
index()
rindex()
translate()
split
now takesexpand
keyword to specify whether to expand dimensionality.return_type
is deprecated. (GH9847)In [18]: s = pd.Series(["a,b", "a,c", "b,c"]) # return Series In [19]: s.str.split(",") Out[19]: 0 [a, b] 1 [a, c] 2 [b, c] Length: 3, dtype: object # return DataFrame In [20]: s.str.split(",", expand=True) Out[20]: 0 1 0 a b 1 a c 2 b c [3 rows x 2 columns] In [21]: idx = pd.Index(["a,b", "a,c", "b,c"]) # return Index In [22]: idx.str.split(",") Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object') # return MultiIndex In [23]: idx.str.split(",", expand=True) Out[23]: MultiIndex([('a', 'b'), ('a', 'c'), ('b', 'c')], )
Improved
extract
andget_dummies
methods forIndex.str
(GH9980)
Other enhancements¶
BusinessHour
offset is now supported, which represents business hours starting from 09:00 - 17:00 onBusinessDay
by default. See Here for details. (GH7905)In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour() Out[24]: Timestamp('2014-08-01 10:00:00') In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour() Out[25]: Timestamp('2014-08-01 10:00:00') In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour() Out[26]: Timestamp('2014-08-04 09:30:00')
DataFrame.diff
now takes anaxis
parameter that determines the direction of differencing (GH9727)Allow
clip
,clip_lower
, andclip_upper
to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have anaxis
parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)DataFrame.mask()
andSeries.mask()
now support same keywords aswhere
(GH8801)drop
function can now accepterrors
keyword to suppressValueError
raised when any of label does not exist in the target data. (GH6736)In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"]) In [28]: df.drop(["A", "X"], axis=1, errors="ignore") Out[28]: B C 0 -0.706771 -1.039575 1 -0.424972 0.567020 2 -1.087401 -0.673690 [3 rows x 2 columns]
Add support for separating years and quarters using dashes, for example 2014-Q1. (GH9688)
Allow conversion of values with dtype
datetime64
ortimedelta64
to strings usingastype(str)
(GH9757)get_dummies
function now acceptssparse
keyword. If set toTrue
, the returnDataFrame
is sparse, e.g.SparseDataFrame
. (GH8823)Period
now acceptsdatetime64
as value input. (GH9054)Allow timedelta string conversion when leading zero is missing from time definition, ie
0:00:00
vs00:00:00
. (GH9570)Allow
Panel.shift
withaxis='items'
(GH9890)Trying to write an excel file now raises
NotImplementedError
if theDataFrame
has aMultiIndex
instead of writing a broken Excel file. (GH9794)Allow
Categorical.add_categories
to acceptSeries
ornp.array
. (GH9927)Add/delete
str/dt/cat
accessors dynamically from__dir__
. (GH9910)Add
normalize
as adt
accessor method. (GH10047)DataFrame
andSeries
now have_constructor_expanddim
property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see herepd.lib.infer_dtype
now returns'bytes'
in Python 3 where appropriate. (GH10032)
API changes¶
When passing in an ax to
df.plot( ..., ax=ax)
, thesharex
kwarg will now default toFalse
. The result is that the visibility of xlabels and xticklabels will not anymore be changed. You have to do that by yourself for the right axes in your figure or setsharex=True
explicitly (but this changes the visible for all axes in the figure, not only the one which is passed in!). If pandas creates the subplots itself (e.g. no passed inax
kwarg), then the default is stillsharex=True
and the visibility changes are applied.assign()
now inserts new columns in alphabetical order. Previously the order was arbitrary. (GH9777)By default,
read_csv
andread_table
will now try to infer the compression type based on the file extension. Setcompression=None
to restore the previous behavior (no decompression). (GH9770)
Deprecations¶
Series.str.split
’sreturn_type
keyword was removed in favor ofexpand
(GH9847)
Index representation¶
The string representation of Index
and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items
; if lots of items (> display.max_seq_items
) will show a truncated display (the head and tail of the data). The formatting for MultiIndex
is unchanged (a multi-line wrapped display). The display width responds to the option display.max_seq_items
, which is defaulted to 100. (GH6482)
Previous behavior
In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')
In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')
In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern
In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern
New behavior
In [29]: pd.set_option("display.width", 80)
In [30]: pd.Index(range(4), name="foo")
Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')
In [31]: pd.Index(range(30), name="foo")
Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')
In [32]: pd.Index(range(104), name="foo")
Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')
In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')
In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
Out[34]:
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
'a', 'bb', 'ccc', 'dddd'],
categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')
In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
Out[35]:
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
'bb',
...
'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
'dddd'],
categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar', length=400)
In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
Out[36]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')
In [37]: pd.date_range("20130101", periods=25, freq="D")
Out[37]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
'2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
'2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
'2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
'2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
'2013-01-25'],
dtype='datetime64[ns]', freq='D')
In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
Out[38]:
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
'2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
'2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
'2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
'2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
...
'2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
'2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
'2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
'2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
'2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')
Performance improvements¶
Bug fixes¶
Bug where labels did not appear properly in the legend of
DataFrame.plot()
, passinglabel=
arguments works, and Series indices are no longer mutated. (GH9542)Bug in json serialization causing a segfault when a frame had zero length. (GH9805)
Bug in
read_csv
where missing trailing delimiters would cause segfault. (GH5664)Bug in retaining index name on appending (GH9862)
Bug in
scatter_matrix
draws unexpected axis ticklabels (GH5662)Fixed bug in
StataWriter
resulting in changes to inputDataFrame
upon save (GH9795).Bug in
transform
causing length mismatch when null entries were present and a fast aggregator was being used (GH9697)Bug in
equals
causing false negatives when block order differed (GH9330)Bug in grouping with multiple
pd.Grouper
where one is non-time based (GH10063)Bug in
read_sql_table
error when reading postgres table with timezone (GH7139)Bug in
DataFrame
slicing may not retain metadata (GH9776)Bug where
TimdeltaIndex
were not properly serialized in fixedHDFStore
(GH9635)Bug with
TimedeltaIndex
constructor ignoringname
when given anotherTimedeltaIndex
as data (GH10025).Bug in
DataFrameFormatter._get_formatted_index
with not applyingmax_colwidth
to theDataFrame
index (GH7856)Bug in
.loc
with a read-only ndarray data source (GH10043)Bug in
groupby.apply()
that would raise if a passed user defined function either returned onlyNone
(for all input). (GH9685)Always use temporary files in pytables tests (GH9992)
Bug in plotting continuously using
secondary_y
may not show legend properly. (GH9610, GH9779)Bug in
DataFrame.plot(kind="hist")
results inTypeError
whenDataFrame
contains non-numeric columns (GH9853)Bug where repeated plotting of
DataFrame
with aDatetimeIndex
may raiseTypeError
(GH9852)Bug in
setup.py
that would allow an incompat cython version to build (GH9827)Bug in plotting
secondary_y
incorrectly attachesright_ax
property to secondary axes specifying itself recursively. (GH9861)Bug in
Series.quantile
on empty Series of typeDatetime
orTimedelta
(GH9675)Bug in
where
causing incorrect results when upcasting was required (GH9731)Bug in
FloatArrayFormatter
where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)Fixed bug where
DataFrame.plot()
raised an error when bothcolor
andstyle
keywords were passed and there was no color symbol in the style strings (GH9671)Not showing a
DeprecationWarning
on combining list-likes with anIndex
(GH10083)Bug in
read_csv
andread_table
when usingskip_rows
parameter if blank lines are present. (GH9832)Bug in
read_csv()
interpretsindex_col=True
as1
(GH9798)Bug in index equality comparisons using
==
failing on Index/MultiIndex type incompatibility (GH9785)Bug in which
SparseDataFrame
could not takenan
as a column name (GH8822)Bug in
to_msgpack
andread_msgpack
zlib and blosc compression support (GH9783)Bug
GroupBy.size
doesn’t attach index name properly if grouped byTimeGrouper
(GH9925)Bug causing an exception in slice assignments because
length_of_indexer
returns wrong results (GH9995)Bug in csv parser causing lines with initial white space plus one non-space character to be skipped. (GH9710)
Bug in C csv parser causing spurious NaNs when data started with newline followed by white space. (GH10022)
Bug causing elements with a null group to spill into the final group when grouping by a
Categorical
(GH9603)Bug where .iloc and .loc behavior is not consistent on empty dataframes (GH9964)
Bug in invalid attribute access on a
TimedeltaIndex
incorrectly raisedValueError
instead ofAttributeError
(GH9680)Bug in unequal comparisons between categorical data and a scalar, which was not in the categories (e.g.
Series(Categorical(list("abc"), ordered=True)) > "d"
. This returnedFalse
for all elements, but now raises aTypeError
. Equality comparisons also now returnFalse
for==
andTrue
for!=
. (GH9848)Bug in DataFrame
__setitem__
when right hand side is a dictionary (GH9874)Bug in
where
when dtype isdatetime64/timedelta64
, but dtype of other is not (GH9804)Bug in
MultiIndex.sortlevel()
results in unicode level name breaks (GH9856)Bug in which
groupby.transform
incorrectly enforced output dtypes to match input dtypes. (GH9807)Bug in
DataFrame
constructor whencolumns
parameter is set, anddata
is an empty list (GH9939)Bug in bar plot with
log=True
raisesTypeError
if all values are less than 1 (GH9905)Bug in horizontal bar plot ignores
log=True
(GH9905)Bug in PyTables queries that did not return proper results using the index (GH8265, GH9676)
Bug where dividing a dataframe containing values of type
Decimal
by anotherDecimal
would raise. (GH9787)Bug where using DataFrames asfreq would remove the name of the index. (GH9885)
Bug causing extra index point when resample BM/BQ (GH9756)
Changed caching in
AbstractHolidayCalendar
to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)Fixed latex output for MultiIndexed dataframes (GH9778)
Bug causing an exception when setting an empty range using
DataFrame.loc
(GH9596)Bug in hiding ticklabels with subplots and shared axes when adding a new plot to an existing grid of axes (GH9158)
Bug in
transform
andfilter
when grouping on a categorical variable (GH9921)Bug in
transform
when groups are equal in number and dtype to the input index (GH9700)Google BigQuery connector now imports dependencies on a per-method basis.(GH9713)
Updated BigQuery connector to no longer use deprecated
oauth2client.tools.run()
(GH8327)Bug in subclassed
DataFrame
. It may not return the correct class, when slicing or subsetting it. (GH9632)Bug in
.median()
where non-float null values are not handled correctly (GH10040)Bug in Series.fillna() where it raises if a numerically convertible string is given (GH10092)