What’s new in 1.2.0 (December 26, 2020)

These are the changes in pandas 1.2.0. See Release notes for a full changelog including other versions of pandas.

Warning

The xlwt package for writing old-style .xls excel files is no longer maintained. The xlrd package is now only for reading old-style .xls files.

Previously, the default argument engine=None to read_excel() would result in using the xlrd engine in many cases, including new Excel 2007+ (.xlsx) files. If openpyxl is installed, many of these cases will now default to using the openpyxl engine. See the read_excel() documentation for more details.

Thus, it is strongly encouraged to install openpyxl to read Excel 2007+ (.xlsx) files. Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. This is no longer supported, switch to using openpyxl instead.

Attempting to use the xlwt engine will raise a FutureWarning unless the option io.excel.xls.writer is set to "xlwt". While this option is now deprecated and will also raise a FutureWarning, it can be globally set and the warning suppressed. Users are recommended to write .xlsx files using the openpyxl engine instead.

Enhancements

Optionally disallow duplicate labels

Series and DataFrame can now be created with allows_duplicate_labels=False flag to control whether the index or columns can contain duplicate labels (GH28394). This can be used to prevent accidental introduction of duplicate labels, which can affect downstream operations.

By default, duplicates continue to be allowed.

In [1]: pd.Series([1, 2], index=['a', 'a'])
Out[1]:
a    1
a    2
Length: 2, dtype: int64

In [2]: pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)
...
DuplicateLabelError: Index has duplicates.
      positions
label
a        [0, 1]

pandas will propagate the allows_duplicate_labels property through many operations.

In [3]: a = (
   ...:     pd.Series([1, 2], index=['a', 'b'])
   ...:       .set_flags(allows_duplicate_labels=False)
   ...: )

In [4]: a
Out[4]:
a    1
b    2
Length: 2, dtype: int64

# An operation introducing duplicates
In [5]: a.reindex(['a', 'b', 'a'])
...
DuplicateLabelError: Index has duplicates.
      positions
label
a        [0, 2]

[1 rows x 1 columns]

Warning

This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.

See Duplicate Labels for more.

The allows_duplicate_labels flag is stored in the new DataFrame.flags attribute. This stores global attributes that apply to the pandas object. This differs from DataFrame.attrs, which stores information that applies to the dataset.

Passing arguments to fsspec backends

Many read/write functions have acquired the storage_options optional argument, to pass a dictionary of parameters to the storage backend. This allows, for example, for passing credentials to S3 and GCS storage. The details of what parameters can be passed to which backends can be found in the documentation of the individual storage backends (detailed from the fsspec docs for builtin implementations and linked to external ones). See Section Reading/writing remote files.

GH35655 added fsspec support (including storage_options) for reading excel files.

Support for binary file handles in to_csv

to_csv() supports file handles in binary mode (GH19827 and GH35058) with encoding (GH13068 and GH23854) and compression (GH22555). If pandas does not automatically detect whether the file handle is opened in binary or text mode, it is necessary to provide mode="wb".

For example:

In [1]: import io

In [2]: data = pd.DataFrame([0, 1, 2])

In [3]: buffer = io.BytesIO()

In [4]: data.to_csv(buffer, encoding="utf-8", compression="gzip")

Support for short caption and table position in to_latex

DataFrame.to_latex() now allows one to specify a floating table position (GH35281) and a short caption (GH36267).

The keyword position has been added to set the position.

In [5]: data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

In [6]: table = data.to_latex(position='ht')

In [7]: print(table)
\begin{table}[ht]
\centering
\begin{tabular}{lrr}
\toprule
{} &  a &  b \\
\midrule
0 &  1 &  3 \\
1 &  2 &  4 \\
\bottomrule
\end{tabular}
\end{table}

Usage of the keyword caption has been extended. Besides taking a single string as an argument, one can optionally provide a tuple (full_caption, short_caption) to add a short caption macro.

In [8]: data = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})

In [9]: table = data.to_latex(caption=('the full long caption', 'short caption'))

In [10]: print(table)
\begin{table}
\centering
\caption[short caption]{the full long caption}
\begin{tabular}{lrr}
\toprule
{} &  a &  b \\
\midrule
0 &  1 &  3 \\
1 &  2 &  4 \\
\bottomrule
\end{tabular}
\end{table}

Change in default floating precision for read_csv and read_table

For the C parsing engine, the methods read_csv() and read_table() previously defaulted to a parser that could read floating point numbers slightly incorrectly with respect to the last bit in precision. The option floating_precision="high" has always been available to avoid this issue. Beginning with this version, the default is now to use the more accurate parser by making floating_precision=None correspond to the high precision parser, and the new option floating_precision="legacy" to use the legacy parser. The change to using the higher precision parser by default should have no impact on performance. (GH17154)

Experimental nullable data types for float data

We’ve added Float32Dtype / Float64Dtype and FloatingArray. These are extension data types dedicated to floating point data that can hold the pd.NA missing value indicator (GH32265, GH34307).

While the default float data type already supports missing values using np.nan, these new data types use pd.NA (and its corresponding behavior) as the missing value indicator, in line with the already existing nullable integer and boolean data types.

One example where the behavior of np.nan and pd.NA is different is comparison operations:

# the default NumPy float64 dtype
In [11]: s1 = pd.Series([1.5, None])

In [12]: s1
Out[12]: 
0    1.5
1    NaN
dtype: float64

In [13]: s1 > 1
Out[13]: 
0     True
1    False
dtype: bool
# the new nullable float64 dtype
In [14]: s2 = pd.Series([1.5, None], dtype="Float64")

In [15]: s2
Out[15]: 
0     1.5
1    <NA>
dtype: Float64

In [16]: s2 > 1
Out[16]: 
0    True
1    <NA>
dtype: boolean

See the Experimental NA scalar to denote missing values doc section for more details on the behavior when using the pd.NA missing value indicator.

As shown above, the dtype can be specified using the “Float64” or “Float32” string (capitalized to distinguish it from the default “float64” data type). Alternatively, you can also use the dtype object:

In [17]: pd.Series([1.5, None], dtype=pd.Float32Dtype())
Out[17]: 
0     1.5
1    <NA>
dtype: Float32

Operations with the existing integer or boolean nullable data types that give float results will now also use the nullable floating data types (GH38178).

Warning

Experimental: the new floating data types are currently experimental, and their behavior or API may still change without warning. Especially the behavior regarding NaN (distinct from NA missing values) is subject to change.

Index/column name preservation when aggregating

When aggregating using concat() or the DataFrame constructor, pandas will now attempt to preserve index and column names whenever possible (GH35847). In the case where all inputs share a common name, this name will be assigned to the result. When the input names do not all agree, the result will be unnamed. Here is an example where the index name is preserved:

In [18]: idx = pd.Index(range(5), name='abc')

In [19]: ser = pd.Series(range(5, 10), index=idx)

In [20]: pd.concat({'x': ser[1:], 'y': ser[:-1]}, axis=1)
Out[20]: 
       x    y
abc          
1    6.0  6.0
2    7.0  7.0
3    8.0  8.0
4    9.0  NaN
0    NaN  5.0

The same is true for MultiIndex, but the logic is applied separately on a level-by-level basis.

GroupBy supports EWM operations directly

DataFrameGroupBy now supports exponentially weighted window operations directly (GH16037).

In [21]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': range(4)})

In [22]: df
Out[22]: 
   A  B
0  a  0
1  b  1
2  a  2
3  b  3

In [23]: df.groupby('A').ewm(com=1.0).mean()
Out[23]: 
            B
A            
a 0  0.000000
  2  1.333333
b 1  1.000000
  3  2.333333

Additionally mean supports execution via Numba with the engine and engine_kwargs arguments. Numba must be installed as an optional dependency to use this feature.

Other enhancements

Notable bug fixes

These are bug fixes that might have notable behavior changes.

Consistency of DataFrame Reductions

DataFrame.any() and DataFrame.all() with bool_only=True now determines whether to exclude object-dtype columns on a column-by-column basis, instead of checking if all object-dtype columns can be considered boolean.

This prevents pathological behavior where applying the reduction on a subset of columns could result in a larger Series result. See (GH37799).

In [24]: df = pd.DataFrame({"A": ["foo", "bar"], "B": [True, False]}, dtype=object)

In [25]: df["C"] = pd.Series([True, True])

Previous behavior:

In [5]: df.all(bool_only=True)
Out[5]:
C    True
dtype: bool

In [6]: df[["B", "C"]].all(bool_only=True)
Out[6]:
B    False
C    True
dtype: bool

New behavior:

In [26]: In [5]: df.all(bool_only=True)
Out[26]: 
B    False
C     True
dtype: bool

In [27]: In [6]: df[["B", "C"]].all(bool_only=True)
Out[27]: 
B    False
C     True
dtype: bool

Other DataFrame reductions with numeric_only=None will also avoid this pathological behavior (GH37827):

In [28]: df = pd.DataFrame({"A": [0, 1, 2], "B": ["a", "b", "c"]}, dtype=object)

Previous behavior:

In [3]: df.mean()
Out[3]: Series([], dtype: float64)

In [4]: df[["A"]].mean()
Out[4]:
A    1.0
dtype: float64

New behavior:

In [29]: df.mean()
Out[29]: 
A    1.0
dtype: float64

In [30]: df[["A"]].mean()
Out[30]: 
A    1.0
dtype: float64

Moreover, DataFrame reductions with numeric_only=None will now be consistent with their Series counterparts. In particular, for reductions where the Series method raises TypeError, the DataFrame reduction will now consider that column non-numeric instead of casting to a NumPy array which may have different semantics (GH36076, GH28949, GH21020).

In [31]: ser = pd.Series([0, 1], dtype="category", name="A")

In [32]: df = ser.to_frame()

Previous behavior:

In [5]: df.any()
Out[5]:
A    True
dtype: bool

New behavior:

In [33]: df.any()
Out[33]: Series([], dtype: bool)

Increased minimum version for Python

pandas 1.2.0 supports Python 3.7.1 and higher (GH35214).

Increased minimum versions for dependencies

Some minimum supported versions of dependencies were updated (GH35214). If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.16.5

X

X

pytz

2017.3

X

X

python-dateutil

2.7.3

X

bottleneck

1.2.1

numexpr

2.6.8

X

pytest (dev)

5.0.1

X

mypy (dev)

0.782

X

For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.

Package

Minimum Version

Changed

beautifulsoup4

4.6.0

fastparquet

0.3.2

fsspec

0.7.4

gcsfs

0.6.0

lxml

4.3.0

X

matplotlib

2.2.3

X

numba

0.46.0

openpyxl

2.6.0

X

pyarrow

0.15.0

X

pymysql

0.7.11

X

pytables

3.5.1

X

s3fs

0.4.0

scipy

1.2.0

sqlalchemy

1.2.8

X

xarray

0.12.3

X

xlrd

1.2.0

X

xlsxwriter

1.0.2

X

xlwt

1.3.0

X

pandas-gbq

0.12.0

See Dependencies and Optional dependencies for more.

Other API changes

  • Sorting in descending order is now stable for Series.sort_values() and Index.sort_values() for Datetime-like Index subclasses. This will affect sort order when sorting a DataFrame on multiple columns, sorting with a key function that produces duplicates, or requesting the sorting index when using Index.sort_values(). When using Series.value_counts(), the count of missing values is no longer necessarily last in the list of duplicate counts. Instead, its position corresponds to the position in the original Series. When using Index.sort_values() for Datetime-like Index subclasses, NaTs ignored the na_position argument and were sorted to the beginning. Now they respect na_position, the default being last, same as other Index subclasses (GH35992)

  • Passing an invalid fill_value to Categorical.take(), DatetimeArray.take(), TimedeltaArray.take(), or PeriodArray.take() now raises a TypeError instead of a ValueError (GH37733)

  • Passing an invalid fill_value to Series.shift() with a CategoricalDtype now raises a TypeError instead of a ValueError (GH37733)

  • Passing an invalid value to IntervalIndex.insert() or CategoricalIndex.insert() now raises a TypeError instead of a ValueError (GH37733)

  • Attempting to reindex a Series with a CategoricalIndex with an invalid fill_value now raises a TypeError instead of a ValueError (GH37733)

  • CategoricalIndex.append() with an index that contains non-category values will now cast instead of raising TypeError (GH38098)

Deprecations

Calling NumPy ufuncs on non-aligned DataFrames

Calling NumPy ufuncs on non-aligned DataFrames changed behaviour in pandas 1.2.0 (to align the inputs before calling the ufunc), but this change is reverted in pandas 1.2.1. The behaviour to not align is now deprecated instead, see the the 1.2.1 release notes for more details.

Performance improvements

  • Performance improvements when creating DataFrame or Series with dtype str or StringDtype from array with many string elements (GH36304, GH36317, GH36325, GH36432, GH37371)

  • Performance improvement in GroupBy.agg() with the numba engine (GH35759)

  • Performance improvements when creating Series.map() from a huge dictionary (GH34717)

  • Performance improvement in GroupBy.transform() with the numba engine (GH36240)

  • Styler uuid method altered to compress data transmission over web whilst maintaining reasonably low table collision probability (GH36345)

  • Performance improvement in to_datetime() with non-ns time unit for float dtype columns (GH20445)

  • Performance improvement in setting values on an IntervalArray (GH36310)

  • The internal index method _shallow_copy() now makes the new index and original index share cached attributes, avoiding creating these again, if created on either. This can speed up operations that depend on creating copies of existing indexes (GH36840)

  • Performance improvement in RollingGroupby.count() (GH35625)

  • Small performance decrease to Rolling.min() and Rolling.max() for fixed windows (GH36567)

  • Reduced peak memory usage in DataFrame.to_pickle() when using protocol=5 in python 3.8+ (GH34244)

  • Faster dir calls when the object has many index labels, e.g. dir(ser) (GH37450)

  • Performance improvement in ExpandingGroupby (GH37064)

  • Performance improvement in Series.astype() and DataFrame.astype() for Categorical (GH8628)

  • Performance improvement in DataFrame.groupby() for float dtype (GH28303), changes of the underlying hash-function can lead to changes in float based indexes sort ordering for ties (e.g. Index.value_counts())

  • Performance improvement in pd.isin() for inputs with more than 1e6 elements (GH36611)

  • Performance improvement for DataFrame.__setitem__() with list-like indexers (GH37954)

  • read_json() now avoids reading entire file into memory when chunksize is specified (GH34548)

Bug fixes

Categorical

  • Categorical.fillna() will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow an NaT as a fill value for numeric categories (GH36530)

  • Bug in Categorical.__setitem__() that incorrectly raised when trying to set a tuple value (GH20439)

  • Bug in CategoricalIndex.equals() incorrectly casting non-category entries to np.nan (GH37667)

  • Bug in CategoricalIndex.where() incorrectly setting non-category entries to np.nan instead of raising TypeError (GH37977)

  • Bug in Categorical.to_numpy() and np.array(categorical) with tz-aware datetime64 categories incorrectly dropping the time zone information instead of casting to object dtype (GH38136)

Datetime-like

  • Bug in DataFrame.combine_first() that would convert datetime-like column on other DataFrame to integer when the column is not present in original DataFrame (GH28481)

  • Bug in DatetimeArray.date where a ValueError would be raised with a read-only backing array (GH33530)

  • Bug in NaT comparisons failing to raise TypeError on invalid inequality comparisons (GH35046)

  • Bug in DateOffset where attributes reconstructed from pickle files differ from original objects when input values exceed normal ranges (e.g. months=12) (GH34511)

  • Bug in DatetimeIndex.get_slice_bound() where datetime.date objects were not accepted or naive Timestamp with a tz-aware DatetimeIndex (GH35690)

  • Bug in DatetimeIndex.slice_locs() where datetime.date objects were not accepted (GH34077)

  • Bug in DatetimeIndex.searchsorted(), TimedeltaIndex.searchsorted(), PeriodIndex.searchsorted(), and Series.searchsorted() with datetime64, timedelta64 or Period dtype placement of NaT values being inconsistent with NumPy (GH36176, GH36254)

  • Inconsistency in DatetimeArray, TimedeltaArray, and PeriodArray method __setitem__ casting arrays of strings to datetime-like scalars but not scalar strings (GH36261)

  • Bug in DatetimeArray.take() incorrectly allowing fill_value with a mismatched time zone (GH37356)

  • Bug in DatetimeIndex.shift incorrectly raising when shifting empty indexes (GH14811)

  • Timestamp and DatetimeIndex comparisons between tz-aware and tz-naive objects now follow the standard library datetime behavior, returning True/False for !=/== and raising for inequality comparisons (GH28507)

  • Bug in DatetimeIndex.equals() and TimedeltaIndex.equals() incorrectly considering int64 indexes as equal (GH36744)

  • Series.to_json(), DataFrame.to_json(), and read_json() now implement time zone parsing when orient structure is table (GH35973)

  • astype() now attempts to convert to datetime64[ns, tz] directly from object with inferred time zone from string (GH35973)

  • Bug in TimedeltaIndex.sum() and Series.sum() with timedelta64 dtype on an empty index or series returning NaT instead of Timedelta(0) (GH31751)

  • Bug in DatetimeArray.shift() incorrectly allowing fill_value with a mismatched time zone (GH37299)

  • Bug in adding a BusinessDay with nonzero offset to a non-scalar other (GH37457)

  • Bug in to_datetime() with a read-only array incorrectly raising (GH34857)

  • Bug in Series.isin() with datetime64[ns] dtype and DatetimeIndex.isin() incorrectly casting integers to datetimes (GH36621)

  • Bug in Series.isin() with datetime64[ns] dtype and DatetimeIndex.isin() failing to consider tz-aware and tz-naive datetimes as always different (GH35728)

  • Bug in Series.isin() with PeriodDtype dtype and PeriodIndex.isin() failing to consider arguments with different PeriodDtype as always different (GH37528)

  • Bug in Period constructor now correctly handles nanoseconds in the value argument (GH34621 and GH17053)

Timedelta

Timezones

Numeric

Conversion

Strings

Interval

Indexing

Missing

MultiIndex

  • Bug in DataFrame.xs() when used with IndexSlice raises TypeError with message "Expected label or tuple of labels" (GH35301)

  • Bug in DataFrame.reset_index() with NaT values in index raises ValueError with message "cannot convert float NaN to integer" (GH36541)

  • Bug in DataFrame.combine_first() when used with MultiIndex containing string and NaN values raises TypeError (GH36562)

  • Bug in MultiIndex.drop() dropped NaN values when non existing key was given as input (GH18853)

  • Bug in MultiIndex.drop() dropping more values than expected when index has duplicates and is not sorted (GH33494)

I/O

Period

Plotting

Styler

  • Bug in Styler.render() HTML was generated incorrectly because of formatting error in rowspan attribute, it now matches with w3 syntax (GH38234)

Groupby/resample/rolling

  • Bug in DataFrameGroupBy.count() and SeriesGroupBy.sum() returning NaN for missing categories when grouped on multiple Categoricals. Now returning 0 (GH35028)

  • Bug in DataFrameGroupBy.apply() that would sometimes throw an erroneous ValueError if the grouping axis had duplicate entries (GH16646)

  • Bug in DataFrame.resample() that would throw a ValueError when resampling from "D" to "24H" over a transition into daylight savings time (DST) (GH35219)

  • Bug when combining methods DataFrame.groupby() with DataFrame.resample() and DataFrame.interpolate() raising a TypeError (GH35325)

  • Bug in DataFrameGroupBy.apply() where a non-nuisance grouping column would be dropped from the output columns if another groupby method was called before .apply (GH34656)

  • Bug when subsetting columns on a DataFrameGroupBy (e.g. df.groupby('a')[['b']])) would reset the attributes axis, dropna, group_keys, level, mutated, sort, and squeeze to their default values (GH9959)

  • Bug in DataFrameGroupBy.tshift() failing to raise ValueError when a frequency cannot be inferred for the index of a group (GH35937)

  • Bug in DataFrame.groupby() does not always maintain column index name for any, all, bfill, ffill, shift (GH29764)

  • Bug in DataFrameGroupBy.apply() raising error with np.nan group(s) when dropna=False (GH35889)

  • Bug in Rolling.sum() returned wrong values when dtypes where mixed between float and integer and axis=1 (GH20649, GH35596)

  • Bug in Rolling.count() returned np.nan with FixedForwardWindowIndexer as window, min_periods=0 and only missing values in the window (GH35579)

  • Bug where pandas.core.window.Rolling produces incorrect window sizes when using a PeriodIndex (GH34225)

  • Bug in DataFrameGroupBy.ffill() and DataFrameGroupBy.bfill() where a NaN group would return filled values instead of NaN when dropna=True (GH34725)

  • Bug in RollingGroupby.count() where a ValueError was raised when specifying the closed parameter (GH35869)

  • Bug in DataFrameGroupBy.rolling() returning wrong values with partial centered window (GH36040)

  • Bug in DataFrameGroupBy.rolling() returned wrong values with time aware window containing NaN. Raises ValueError because windows are not monotonic now (GH34617)

  • Bug in Rolling.__iter__() where a ValueError was not raised when min_periods was larger than window (GH37156)

  • Using Rolling.var() instead of Rolling.std() avoids numerical issues for Rolling.corr() when Rolling.var() is still within floating point precision while Rolling.std() is not (GH31286)

  • Bug in DataFrameGroupBy.quantile() and Resampler.quantile() raised TypeError when values were of type Timedelta (GH29485)

  • Bug in Rolling.median() and Rolling.quantile() returned wrong values for BaseIndexer subclasses with non-monotonic starting or ending points for windows (GH37153)

  • Bug in DataFrame.groupby() dropped nan groups from result with dropna=False when grouping over a single column (GH35646, GH35542)

  • Bug in DataFrameGroupBy.head(), DataFrameGroupBy.tail(), SeriesGroupBy.head(), and SeriesGroupBy.tail() would raise when used with axis=1 (GH9772)

  • Bug in DataFrameGroupBy.transform() would raise when used with axis=1 and a transformation kernel (e.g. “shift”) (GH36308)

  • Bug in DataFrameGroupBy.resample() using .agg with sum produced different result than just calling .sum (GH33548)

  • Bug in DataFrameGroupBy.apply() dropped values on nan group when returning the same axes with the original frame (GH38227)

  • Bug in DataFrameGroupBy.quantile() couldn’t handle with arraylike q when grouping by columns (GH33795)

  • Bug in DataFrameGroupBy.rank() with datetime64tz or period dtype incorrectly casting results to those dtypes instead of returning float64 dtype (GH38187)

Reshaping

ExtensionArray

Other

Contributors

For contributors, please see /usr/share/doc/contributors_list.txt or https://github.com/pandas-dev/pandas/graphs/contributors