Version 0.16.1 (May 11, 2015)

This is a minor bug-fix release from 0.16.0 and includes a large number of bug fixes along several new features, enhancements, and performance improvements. We recommend that all users upgrade to this version.

Highlights include:

  • Support for a CategoricalIndex, a category based index, see here

  • New section on how-to-contribute to pandas, see here

  • Revised “Merge, join, and concatenate” documentation, including graphical examples to make it easier to understand each operations, see here

  • New method sample for drawing random samples from Series, DataFrames and Panels. See here

  • The default Index printing has changed to a more uniform format, see here

  • BusinessHour datetime-offset is now supported, see here

  • Further enhancement to the .str accessor to make string operations easier, see here

Warning

In pandas 0.17.0, the sub-package pandas.io.data will be removed in favor of a separately installable package (GH8961).

Enhancements

CategoricalIndex

We introduce a CategoricalIndex, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.

In [1]: df = pd.DataFrame({'A': np.arange(6),
   ...:                    'B': pd.Series(list('aabbca'))
   ...:                           .astype('category', categories=list('cab'))
   ...:                    })
   ...:

In [2]: df
Out[2]:
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [3]: df.dtypes
Out[3]:
A       int64
B    category
dtype: object

In [4]: df.B.cat.categories
Out[4]: Index(['c', 'a', 'b'], dtype='object')

setting the index, will create a CategoricalIndex

In [5]: df2 = df.set_index('B')

In [6]: df2.index
Out[6]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.

In [7]: df2.loc['a']
Out[7]:
   A
B
a  0
a  1
a  5

and preserves the CategoricalIndex

In [8]: df2.loc['a'].index
Out[8]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

sorting will order by the order of the categories

In [9]: df2.sort_index()
Out[9]:
   A
B
c  4
a  0
a  1
a  5
b  2
b  3

groupby operations on the index will preserve the index nature as well

In [10]: df2.groupby(level=0).sum()
Out[10]:
   A
B
c  4
a  6
b  5

In [11]: df2.groupby(level=0).sum().index
Out[11]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the PASSED Categorical dtype. This allows one to arbitrarily index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.

In [12]: df2.reindex(['a', 'e'])
Out[12]:
     A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [13]: df2.reindex(['a', 'e']).index
Out[13]: pd.Index(['a', 'a', 'a', 'e'], dtype='object', name='B')

In [14]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[14]:
     A
B
a  0.0
a  1.0
a  5.0
e  NaN

In [15]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[15]: pd.CategoricalIndex(['a', 'a', 'a', 'e'],
                             categories=['a', 'b', 'c', 'd', 'e'],
                             ordered=False, name='B',
                             dtype='category')

See the documentation for more. (GH7629, GH10038, GH10039)

Sample

Series, DataFrames, and Panels now have a new method: sample(). The method accepts a specific number of rows or columns to return, or a fraction of the total number or rows or columns. It also has options for sampling with or without replacement, for passing in a column for weights for non-uniform sampling, and for setting seed values to facilitate replication. (GH2419)

In [1]: example_series = pd.Series([0, 1, 2, 3, 4, 5])

# When no arguments are passed, returns 1
In [2]: example_series.sample()
Out[2]: 
3    3
Length: 1, dtype: int64

# One may specify either a number of rows:
In [3]: example_series.sample(n=3)
Out[3]: 
2    2
1    1
0    0
Length: 3, dtype: int64

# Or a fraction of the rows:
In [4]: example_series.sample(frac=0.5)
Out[4]: 
1    1
5    5
3    3
Length: 3, dtype: int64

# weights are accepted.
In [5]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]

In [6]: example_series.sample(n=3, weights=example_weights)
Out[6]: 
2    2
4    4
3    3
Length: 3, dtype: int64

# weights will also be normalized if they do not sum to one,
# and missing values will be treated as zeros.
In [7]: example_weights2 = [0.5, 0, 0, 0, None, np.nan]

In [8]: example_series.sample(n=1, weights=example_weights2)
Out[8]: 
0    0
Length: 1, dtype: int64

When applied to a DataFrame, one may pass the name of a column to specify sampling weights when sampling from rows.

In [9]: df = pd.DataFrame({"col1": [9, 8, 7, 6], "weight_column": [0.5, 0.4, 0.1, 0]})

In [10]: df.sample(n=3, weights="weight_column")
Out[10]: 
   col1  weight_column
0     9            0.5
1     8            0.4
2     7            0.1

[3 rows x 2 columns]

String methods enhancements

Continuing from v0.16.0, the following enhancements make string operations easier and more consistent with standard python string operations.

  • Added StringMethods (.str accessor) to Index (GH9068)

    The .str accessor is now available for both Series and Index.

    In [11]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
    
    In [12]: idx.str.strip()
    Out[12]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
    

    One special case for the .str accessor on Index is that if a string method returns bool, the .str accessor will return a np.array instead of a boolean Index (GH8875). This enables the following expression to work naturally:

    In [13]: idx = pd.Index(["a1", "a2", "b1", "b2"])
    
    In [14]: s = pd.Series(range(4), index=idx)
    
    In [15]: s
    Out[15]: 
    a1    0
    a2    1
    b1    2
    b2    3
    Length: 4, dtype: int64
    
    In [16]: idx.str.startswith("a")
    Out[16]: array([ True,  True, False, False])
    
    In [17]: s[s.index.str.startswith("a")]
    Out[17]: 
    a1    0
    a2    1
    Length: 2, dtype: int64
    
  • The following new methods are accessible via .str accessor to apply the function to each values. (GH9766, GH9773, GH10031, GH10045, GH10052)

    Methods

    capitalize()

    swapcase()

    normalize()

    partition()

    rpartition()

    index()

    rindex()

    translate()

  • split now takes expand keyword to specify whether to expand dimensionality. return_type is deprecated. (GH9847)

    In [18]: s = pd.Series(["a,b", "a,c", "b,c"])
    
    # return Series
    In [19]: s.str.split(",")
    Out[19]: 
    0    [a, b]
    1    [a, c]
    2    [b, c]
    Length: 3, dtype: object
    
    # return DataFrame
    In [20]: s.str.split(",", expand=True)
    Out[20]: 
       0  1
    0  a  b
    1  a  c
    2  b  c
    
    [3 rows x 2 columns]
    
    In [21]: idx = pd.Index(["a,b", "a,c", "b,c"])
    
    # return Index
    In [22]: idx.str.split(",")
    Out[22]: Index([['a', 'b'], ['a', 'c'], ['b', 'c']], dtype='object')
    
    # return MultiIndex
    In [23]: idx.str.split(",", expand=True)
    Out[23]: 
    MultiIndex([('a', 'b'),
                ('a', 'c'),
                ('b', 'c')],
               )
    
  • Improved extract and get_dummies methods for Index.str (GH9980)

Other enhancements

  • BusinessHour offset is now supported, which represents business hours starting from 09:00 - 17:00 on BusinessDay by default. See Here for details. (GH7905)

    In [24]: pd.Timestamp("2014-08-01 09:00") + pd.tseries.offsets.BusinessHour()
    Out[24]: Timestamp('2014-08-01 10:00:00')
    
    In [25]: pd.Timestamp("2014-08-01 07:00") + pd.tseries.offsets.BusinessHour()
    Out[25]: Timestamp('2014-08-01 10:00:00')
    
    In [26]: pd.Timestamp("2014-08-01 16:30") + pd.tseries.offsets.BusinessHour()
    Out[26]: Timestamp('2014-08-04 09:30:00')
    
  • DataFrame.diff now takes an axis parameter that determines the direction of differencing (GH9727)

  • Allow clip, clip_lower, and clip_upper to accept array-like arguments as thresholds (This is a regression from 0.11.0). These methods now have an axis parameter which determines how the Series or DataFrame will be aligned with the threshold(s). (GH6966)

  • DataFrame.mask() and Series.mask() now support same keywords as where (GH8801)

  • drop function can now accept errors keyword to suppress ValueError raised when any of label does not exist in the target data. (GH6736)

    In [27]: df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"])
    
    In [28]: df.drop(["A", "X"], axis=1, errors="ignore")
    Out[28]: 
              B         C
    0 -0.706771 -1.039575
    1 -0.424972  0.567020
    2 -1.087401 -0.673690
    
    [3 rows x 2 columns]
    
  • Add support for separating years and quarters using dashes, for example 2014-Q1. (GH9688)

  • Allow conversion of values with dtype datetime64 or timedelta64 to strings using astype(str) (GH9757)

  • get_dummies function now accepts sparse keyword. If set to True, the return DataFrame is sparse, e.g. SparseDataFrame. (GH8823)

  • Period now accepts datetime64 as value input. (GH9054)

  • Allow timedelta string conversion when leading zero is missing from time definition, ie 0:00:00 vs 00:00:00. (GH9570)

  • Allow Panel.shift with axis='items' (GH9890)

  • Trying to write an excel file now raises NotImplementedError if the DataFrame has a MultiIndex instead of writing a broken Excel file. (GH9794)

  • Allow Categorical.add_categories to accept Series or np.array. (GH9927)

  • Add/delete str/dt/cat accessors dynamically from __dir__. (GH9910)

  • Add normalize as a dt accessor method. (GH10047)

  • DataFrame and Series now have _constructor_expanddim property as overridable constructor for one higher dimensionality data. This should be used only when it is really needed, see here

  • pd.lib.infer_dtype now returns 'bytes' in Python 3 where appropriate. (GH10032)

API changes

  • When passing in an ax to df.plot( ..., ax=ax), the sharex kwarg will now default to False. The result is that the visibility of xlabels and xticklabels will not anymore be changed. You have to do that by yourself for the right axes in your figure or set sharex=True explicitly (but this changes the visible for all axes in the figure, not only the one which is passed in!). If pandas creates the subplots itself (e.g. no passed in ax kwarg), then the default is still sharex=True and the visibility changes are applied.

  • assign() now inserts new columns in alphabetical order. Previously the order was arbitrary. (GH9777)

  • By default, read_csv and read_table will now try to infer the compression type based on the file extension. Set compression=None to restore the previous behavior (no decompression). (GH9770)

Deprecations

  • Series.str.split’s return_type keyword was removed in favor of expand (GH9847)

Index representation

The string representation of Index and its sub-classes have now been unified. These will show a single-line display if there are few values; a wrapped multi-line display for a lot of values (but less than display.max_seq_items; if lots of items (> display.max_seq_items) will show a truncated display (the head and tail of the data). The formatting for MultiIndex is unchanged (a multi-line wrapped display). The display width responds to the option display.max_seq_items, which is defaulted to 100. (GH6482)

Previous behavior

In [2]: pd.Index(range(4), name='foo')
Out[2]: Int64Index([0, 1, 2, 3], dtype='int64')

In [3]: pd.Index(range(104), name='foo')
Out[3]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, ...], dtype='int64')

In [4]: pd.date_range('20130101', periods=4, name='foo', tz='US/Eastern')
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-01-04 00:00:00-05:00]
Length: 4, Freq: D, Timezone: US/Eastern

In [5]: pd.date_range('20130101', periods=104, name='foo', tz='US/Eastern')
Out[5]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00-05:00, ..., 2013-04-14 00:00:00-04:00]
Length: 104, Freq: D, Timezone: US/Eastern

New behavior

In [29]: pd.set_option("display.width", 80)

In [30]: pd.Index(range(4), name="foo")
Out[30]: RangeIndex(start=0, stop=4, step=1, name='foo')

In [31]: pd.Index(range(30), name="foo")
Out[31]: RangeIndex(start=0, stop=30, step=1, name='foo')

In [32]: pd.Index(range(104), name="foo")
Out[32]: RangeIndex(start=0, stop=104, step=1, name='foo')

In [33]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"], ordered=True, name="foobar")
Out[33]: CategoricalIndex(['a', 'bb', 'ccc', 'dddd'], categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [34]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 10, ordered=True, name="foobar")
Out[34]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
                  'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb',
                  'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
                  'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd',
                  'a', 'bb', 'ccc', 'dddd'],
                 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar')

In [35]: pd.CategoricalIndex(["a", "bb", "ccc", "dddd"] * 100, ordered=True, name="foobar")
Out[35]: 
CategoricalIndex(['a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a',
                  'bb',
                  ...
                  'ccc', 'dddd', 'a', 'bb', 'ccc', 'dddd', 'a', 'bb', 'ccc',
                  'dddd'],
                 categories=['a', 'bb', 'ccc', 'dddd'], ordered=True, dtype='category', name='foobar', length=400)

In [36]: pd.date_range("20130101", periods=4, name="foo", tz="US/Eastern")
Out[36]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', name='foo', freq='D')

In [37]: pd.date_range("20130101", periods=25, freq="D")
Out[37]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06', '2013-01-07', '2013-01-08',
               '2013-01-09', '2013-01-10', '2013-01-11', '2013-01-12',
               '2013-01-13', '2013-01-14', '2013-01-15', '2013-01-16',
               '2013-01-17', '2013-01-18', '2013-01-19', '2013-01-20',
               '2013-01-21', '2013-01-22', '2013-01-23', '2013-01-24',
               '2013-01-25'],
              dtype='datetime64[ns]', freq='D')

In [38]: pd.date_range("20130101", periods=104, name="foo", tz="US/Eastern")
Out[38]: 
DatetimeIndex(['2013-01-01 00:00:00-05:00', '2013-01-02 00:00:00-05:00',
               '2013-01-03 00:00:00-05:00', '2013-01-04 00:00:00-05:00',
               '2013-01-05 00:00:00-05:00', '2013-01-06 00:00:00-05:00',
               '2013-01-07 00:00:00-05:00', '2013-01-08 00:00:00-05:00',
               '2013-01-09 00:00:00-05:00', '2013-01-10 00:00:00-05:00',
               ...
               '2013-04-05 00:00:00-04:00', '2013-04-06 00:00:00-04:00',
               '2013-04-07 00:00:00-04:00', '2013-04-08 00:00:00-04:00',
               '2013-04-09 00:00:00-04:00', '2013-04-10 00:00:00-04:00',
               '2013-04-11 00:00:00-04:00', '2013-04-12 00:00:00-04:00',
               '2013-04-13 00:00:00-04:00', '2013-04-14 00:00:00-04:00'],
              dtype='datetime64[ns, US/Eastern]', name='foo', length=104, freq='D')

Performance improvements

  • Improved csv write performance with mixed dtypes, including datetimes by up to 5x (GH9940)

  • Improved csv write performance generally by 2x (GH9940)

  • Improved the performance of pd.lib.max_len_string_array by 5-7x (GH10024)

Bug fixes

  • Bug where labels did not appear properly in the legend of DataFrame.plot(), passing label= arguments works, and Series indices are no longer mutated. (GH9542)

  • Bug in json serialization causing a segfault when a frame had zero length. (GH9805)

  • Bug in read_csv where missing trailing delimiters would cause segfault. (GH5664)

  • Bug in retaining index name on appending (GH9862)

  • Bug in scatter_matrix draws unexpected axis ticklabels (GH5662)

  • Fixed bug in StataWriter resulting in changes to input DataFrame upon save (GH9795).

  • Bug in transform causing length mismatch when null entries were present and a fast aggregator was being used (GH9697)

  • Bug in equals causing false negatives when block order differed (GH9330)

  • Bug in grouping with multiple pd.Grouper where one is non-time based (GH10063)

  • Bug in read_sql_table error when reading postgres table with timezone (GH7139)

  • Bug in DataFrame slicing may not retain metadata (GH9776)

  • Bug where TimdeltaIndex were not properly serialized in fixed HDFStore (GH9635)

  • Bug with TimedeltaIndex constructor ignoring name when given another TimedeltaIndex as data (GH10025).

  • Bug in DataFrameFormatter._get_formatted_index with not applying max_colwidth to the DataFrame index (GH7856)

  • Bug in .loc with a read-only ndarray data source (GH10043)

  • Bug in groupby.apply() that would raise if a passed user defined function either returned only None (for all input). (GH9685)

  • Always use temporary files in pytables tests (GH9992)

  • Bug in plotting continuously using secondary_y may not show legend properly. (GH9610, GH9779)

  • Bug in DataFrame.plot(kind="hist") results in TypeError when DataFrame contains non-numeric columns (GH9853)

  • Bug where repeated plotting of DataFrame with a DatetimeIndex may raise TypeError (GH9852)

  • Bug in setup.py that would allow an incompat cython version to build (GH9827)

  • Bug in plotting secondary_y incorrectly attaches right_ax property to secondary axes specifying itself recursively. (GH9861)

  • Bug in Series.quantile on empty Series of type Datetime or Timedelta (GH9675)

  • Bug in where causing incorrect results when upcasting was required (GH9731)

  • Bug in FloatArrayFormatter where decision boundary for displaying “small” floats in decimal format is off by one order of magnitude for a given display.precision (GH9764)

  • Fixed bug where DataFrame.plot() raised an error when both color and style keywords were passed and there was no color symbol in the style strings (GH9671)

  • Not showing a DeprecationWarning on combining list-likes with an Index (GH10083)

  • Bug in read_csv and read_table when using skip_rows parameter if blank lines are present. (GH9832)

  • Bug in read_csv() interprets index_col=True as 1 (GH9798)

  • Bug in index equality comparisons using == failing on Index/MultiIndex type incompatibility (GH9785)

  • Bug in which SparseDataFrame could not take nan as a column name (GH8822)

  • Bug in to_msgpack and read_msgpack zlib and blosc compression support (GH9783)

  • Bug GroupBy.size doesn’t attach index name properly if grouped by TimeGrouper (GH9925)

  • Bug causing an exception in slice assignments because length_of_indexer returns wrong results (GH9995)

  • Bug in csv parser causing lines with initial white space plus one non-space character to be skipped. (GH9710)

  • Bug in C csv parser causing spurious NaNs when data started with newline followed by white space. (GH10022)

  • Bug causing elements with a null group to spill into the final group when grouping by a Categorical (GH9603)

  • Bug where .iloc and .loc behavior is not consistent on empty dataframes (GH9964)

  • Bug in invalid attribute access on a TimedeltaIndex incorrectly raised ValueError instead of AttributeError (GH9680)

  • Bug in unequal comparisons between categorical data and a scalar, which was not in the categories (e.g. Series(Categorical(list("abc"), ordered=True)) > "d". This returned False for all elements, but now raises a TypeError. Equality comparisons also now return False for == and True for !=. (GH9848)

  • Bug in DataFrame __setitem__ when right hand side is a dictionary (GH9874)

  • Bug in where when dtype is datetime64/timedelta64, but dtype of other is not (GH9804)

  • Bug in MultiIndex.sortlevel() results in unicode level name breaks (GH9856)

  • Bug in which groupby.transform incorrectly enforced output dtypes to match input dtypes. (GH9807)

  • Bug in DataFrame constructor when columns parameter is set, and data is an empty list (GH9939)

  • Bug in bar plot with log=True raises TypeError if all values are less than 1 (GH9905)

  • Bug in horizontal bar plot ignores log=True (GH9905)

  • Bug in PyTables queries that did not return proper results using the index (GH8265, GH9676)

  • Bug where dividing a dataframe containing values of type Decimal by another Decimal would raise. (GH9787)

  • Bug where using DataFrames asfreq would remove the name of the index. (GH9885)

  • Bug causing extra index point when resample BM/BQ (GH9756)

  • Changed caching in AbstractHolidayCalendar to be at the instance level rather than at the class level as the latter can result in unexpected behaviour. (GH9552)

  • Fixed latex output for MultiIndexed dataframes (GH9778)

  • Bug causing an exception when setting an empty range using DataFrame.loc (GH9596)

  • Bug in hiding ticklabels with subplots and shared axes when adding a new plot to an existing grid of axes (GH9158)

  • Bug in transform and filter when grouping on a categorical variable (GH9921)

  • Bug in transform when groups are equal in number and dtype to the input index (GH9700)

  • Google BigQuery connector now imports dependencies on a per-method basis.(GH9713)

  • Updated BigQuery connector to no longer use deprecated oauth2client.tools.run() (GH8327)

  • Bug in subclassed DataFrame. It may not return the correct class, when slicing or subsetting it. (GH9632)

  • Bug in .median() where non-float null values are not handled correctly (GH10040)

  • Bug in Series.fillna() where it raises if a numerically convertible string is given (GH10092)

Contributors

For contributors, please see /usr/share/doc/contributors_list.txt or https://github.com/pandas-dev/pandas/graphs/contributors