What’s new in 2.1.0 (Aug 30, 2023)#

These are the changes in pandas 2.1.0. See Release notes for a full changelog including other versions of pandas.

Enhancements#

PyArrow will become a required dependency with pandas 3.0#

PyArrow will become a required dependency of pandas starting with pandas 3.0. This decision was made based on PDEP 10.

This will enable more changes that are hugely beneficial to pandas users, including but not limited to:

  • inferring strings as PyArrow backed strings by default enabling a significant reduction of the memory footprint and huge performance improvements.

  • inferring more complex dtypes with PyArrow by default, like Decimal, lists, bytes, structured data and more.

  • Better interoperability with other libraries that depend on Apache Arrow.

We are collecting feedback on this decision here.

Avoid NumPy object dtype for strings by default#

Previously, all strings were stored in columns with NumPy object dtype by default. This release introduces an option future.infer_string that infers all strings as PyArrow backed strings with dtype "string[pyarrow_numpy]" instead. This is a new string dtype implementation that follows NumPy semantics in comparison operations and will return np.nan as the missing value indicator. Setting the option will also infer the dtype "string" as a StringDtype with storage set to "pyarrow_numpy", ignoring the value behind the option mode.string_storage.

This option only works if PyArrow is installed. PyArrow backed strings have a significantly reduced memory footprint and provide a big performance improvement compared to NumPy object (GH 54430).

The option can be enabled with:

pd.options.future.infer_string = True

This behavior will become the default with pandas 3.0.

DataFrame reductions preserve extension dtypes#

In previous versions of pandas, the results of DataFrame reductions (DataFrame.sum() DataFrame.mean() etc.) had NumPy dtypes, even when the DataFrames were of extension dtypes. Pandas can now keep the dtypes when doing reductions over DataFrame columns with a common dtype (GH 52788).

Old Behavior

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")
In [2]: df.sum()
Out[2]:
a    5
b    9
dtype: int64
In [3]: df = df.astype("int64[pyarrow]")
In [4]: df.sum()
Out[4]:
a    5
b    9
dtype: int64

New Behavior

In [1]: df = pd.DataFrame({"a": [1, 1, 2, 1], "b": [np.nan, 2.0, 3.0, 4.0]}, dtype="Int64")

In [2]: df.sum()
Out[2]: 
a    5
b    9
dtype: Int64

In [3]: df = df.astype("int64[pyarrow]")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 df = df.astype("int64[pyarrow]")

File /usr/lib/python3/dist-packages/pandas/core/generic.py:6628, in NDFrame.astype(self, dtype, copy, errors)
   6625                 raise
   6626         results.append(res_col)
-> 6628 elif is_extension_array_dtype(dtype) and self.ndim > 1:
   6629     # TODO(EA2D): special case not needed with 2D EAs
   6630     dtype = pandas_dtype(dtype)
   6631     if isinstance(dtype, ExtensionDtype) and all(
   6632         arr.dtype == dtype for arr in self._mgr.arrays
   6633     ):

File /usr/lib/python3/dist-packages/pandas/core/dtypes/common.py:1328, in is_extension_array_dtype(arr_or_dtype)
   1326     return False
   1327 else:
-> 1328     return registry.find(dtype) is not None

File /usr/lib/python3/dist-packages/pandas/core/dtypes/base.py:576, in Registry.find(self, dtype)
    574 for dtype_type in self.dtypes:
    575     try:
--> 576         return dtype_type.construct_from_string(dtype)
    577     except TypeError:
    578         pass

File /usr/lib/python3/dist-packages/pandas/core/dtypes/dtypes.py:2251, in ArrowDtype.construct_from_string(cls, string)
   2249 base_type = string[:-9]  # get rid of "[pyarrow]"
   2250 try:
-> 2251     pa_dtype = pa.type_for_alias(base_type)
   2252 except ValueError as err:
   2253     has_parameters = re.search(r"[\[\(].*[\]\)]", base_type)

NameError: name 'pa' is not defined

In [4]: df.sum()
Out[4]: 
a    5
b    9
dtype: Int64

Notice that the dtype is now a masked dtype and PyArrow dtype, respectively, while previously it was a NumPy integer dtype.

To allow DataFrame reductions to preserve extension dtypes, ExtensionArray._reduce() has gotten a new keyword parameter keepdims. Calling ExtensionArray._reduce() with keepdims=True should return an array of length 1 along the reduction axis. In order to maintain backward compatibility, the parameter is not required, but will it become required in the future. If the parameter is not found in the signature, DataFrame reductions can not preserve extension dtypes. Also, if the parameter is not found, a FutureWarning will be emitted and type checkers like mypy may complain about the signature not being compatible with ExtensionArray._reduce().

Copy-on-Write improvements#

  • Series.transform() not respecting Copy-on-Write when func modifies Series inplace (GH 53747)

  • Calling Index.values() will now return a read-only NumPy array (GH 53704)

  • Setting a Series into a DataFrame now creates a lazy instead of a deep copy (GH 53142)

  • The DataFrame constructor, when constructing a DataFrame from a dictionary of Index objects and specifying copy=False, will now use a lazy copy of those Index objects for the columns of the DataFrame (GH 52947)

  • A shallow copy of a Series or DataFrame (df.copy(deep=False)) will now also return a shallow copy of the rows/columns Index objects instead of only a shallow copy of the data, i.e. the index of the result is no longer identical (df.copy(deep=False).index is df.index is no longer True) (GH 53721)

  • DataFrame.head() and DataFrame.tail() will now return deep copies (GH 54011)

  • Add lazy copy mechanism to DataFrame.eval() (GH 53746)

  • Trying to operate inplace on a temporary column selection (for example, df["a"].fillna(100, inplace=True)) will now always raise a warning when Copy-on-Write is enabled. In this mode, operating inplace like this will never work, since the selection behaves as a temporary copy. This holds true for:

    • DataFrame.update / Series.update

    • DataFrame.fillna / Series.fillna

    • DataFrame.replace / Series.replace

    • DataFrame.clip / Series.clip

    • DataFrame.where / Series.where

    • DataFrame.mask / Series.mask

    • DataFrame.interpolate / Series.interpolate

    • DataFrame.ffill / Series.ffill

    • DataFrame.bfill / Series.bfill

New DataFrame.map() method and support for ExtensionArrays#

The DataFrame.map() been added and DataFrame.applymap() has been deprecated. DataFrame.map() has the same functionality as DataFrame.applymap(), but the new name better communicates that this is the DataFrame version of Series.map() (GH 52353).

When given a callable, Series.map() applies the callable to all elements of the Series. Similarly, DataFrame.map() applies the callable to all elements of the DataFrame, while Index.map() applies the callable to all elements of the Index.

Frequently, it is not desirable to apply the callable to nan-like values of the array and to avoid doing that, the map method could be called with na_action="ignore", i.e. ser.map(func, na_action="ignore"). However, na_action="ignore" was not implemented for many ExtensionArray and Index types and na_action="ignore" did not work correctly for any ExtensionArray subclass except the nullable numeric ones (i.e. with dtype Int64 etc.).

na_action="ignore" now works for all array types (GH 52219, GH 51645, GH 51809, GH 51936, GH 52033; GH 52096).

Previous behavior:

In [1]: ser = pd.Series(["a", "b", np.nan], dtype="category")
In [2]: ser.map(str.upper, na_action="ignore")
NotImplementedError
In [3]: df = pd.DataFrame(ser)
In [4]: df.applymap(str.upper, na_action="ignore")  # worked for DataFrame
     0
0    A
1    B
2  NaN
In [5]: idx = pd.Index(ser)
In [6]: idx.map(str.upper, na_action="ignore")
TypeError: CategoricalIndex.map() got an unexpected keyword argument 'na_action'

New behavior:

In [5]: ser = pd.Series(["a", "b", np.nan], dtype="category")

In [6]: ser.map(str.upper, na_action="ignore")
Out[6]: 
0      A
1      B
2    NaN
dtype: category
Categories (2, object): ['A', 'B']

In [7]: df = pd.DataFrame(ser)

In [8]: df.map(str.upper, na_action="ignore")
Out[8]: 
     0
0    A
1    B
2  NaN

In [9]: idx = pd.Index(ser)

In [10]: idx.map(str.upper, na_action="ignore")
Out[10]: CategoricalIndex(['A', 'B', nan], categories=['A', 'B'], ordered=False, dtype='category')

Also, note that Categorical.map() implicitly has had its na_action set to "ignore" by default. This has been deprecated and the default for Categorical.map() will change to na_action=None, consistent with all the other array types.

New implementation of DataFrame.stack()#

pandas has reimplemented DataFrame.stack(). To use the new implementation, pass the argument future_stack=True. This will become the only option in pandas 3.0.

The previous implementation had two main behavioral downsides.

  1. The previous implementation would unnecessarily introduce NA values into the result. The user could have NA values automatically removed by passing dropna=True (the default), but doing this could also remove NA values from the result that existed in the input. See the examples below.

  2. The previous implementation with sort=True (the default) would sometimes sort part of the resulting index, and sometimes not. If the input’s columns are not a MultiIndex, then the resulting index would never be sorted. If the columns are a MultiIndex, then in most cases the level(s) in the resulting index that come from stacking the column level(s) would be sorted. In rare cases such level(s) would be sorted in a non-standard order, depending on how the columns were created.

The new implementation (future_stack=True) will no longer unnecessarily introduce NA values when stacking multiple levels and will never sort. As such, the arguments dropna and sort are not utilized and must remain unspecified when using future_stack=True. These arguments will be removed in the next major release.

In [11]: columns = pd.MultiIndex.from_tuples([("B", "d"), ("A", "c")])

In [12]: df = pd.DataFrame([[0, 2], [1, 3]], index=["z", "y"], columns=columns)

In [13]: df
Out[13]: 
   B  A
   d  c
z  0  2
y  1  3

In the previous version (future_stack=False), the default of dropna=True would remove unnecessarily introduced NA values but still coerce the dtype to float64 in the process. In the new version, no NAs are introduced and so there is no coercion of the dtype.

In [14]: df.stack([0, 1], future_stack=False, dropna=True)
Out[14]: 
z  A  c    2.0
   B  d    0.0
y  A  c    3.0
   B  d    1.0
dtype: float64

In [15]: df.stack([0, 1], future_stack=True)
Out[15]: 
z  B  d    0
   A  c    2
y  B  d    1
   A  c    3
dtype: int64

If the input contains NA values, the previous version would drop those as well with dropna=True or introduce new NA values with dropna=False. The new version persists all values from the input.

In [16]: df = pd.DataFrame([[0, 2], [np.nan, np.nan]], columns=columns)

In [17]: df
Out[17]: 
     B    A
     d    c
0  0.0  2.0
1  NaN  NaN

In [18]: df.stack([0, 1], future_stack=False, dropna=True)
Out[18]: 
0  A  c    2.0
   B  d    0.0
dtype: float64

In [19]: df.stack([0, 1], future_stack=False, dropna=False)
Out[19]: 
0  A  d    NaN
      c    2.0
   B  d    0.0
      c    NaN
1  A  d    NaN
      c    NaN
   B  d    NaN
      c    NaN
dtype: float64

In [20]: df.stack([0, 1], future_stack=True)
Out[20]: 
0  B  d    0.0
   A  c    2.0
1  B  d    NaN
   A  c    NaN
dtype: float64

Other enhancements#

Backwards incompatible API changes#

Increased minimum version for Python#

pandas 2.1.0 supports Python 3.9 and higher.

Increased minimum versions for dependencies#

Some minimum supported versions of dependencies were updated. If installed, we now require:

Package

Minimum Version

Required

Changed

numpy

1.22.4

X

X

mypy (dev)

1.4.1

X

beautifulsoup4

4.11.1

X

bottleneck

1.3.4

X

dataframe-api-compat

0.1.7

X

fastparquet

0.8.1

X

fsspec

2022.05.0

X

hypothesis

6.46.1

X

gcsfs

2022.05.0

X

jinja2

3.1.2

X

lxml

4.8.0

X

numba

0.55.2

X

numexpr

2.8.0

X

openpyxl

3.0.10

X

pandas-gbq

0.17.5

X

psycopg2

2.9.3

X

pyreadstat

1.1.5

X

pyqt5

5.15.6

X

pytables

3.7.0

X

pytest

7.3.2

X

python-snappy

0.6.1

X

pyxlsb

1.0.9

X

s3fs

2022.05.0

X

scipy

1.8.1

X

sqlalchemy

1.4.36

X

tabulate

0.8.10

X

xarray

2022.03.0

X

xlsxwriter

3.0.3

X

zstandard

0.17.0

X

For optional libraries the general recommendation is to use the latest version.

See Dependencies and Optional dependencies for more.

Other API changes#

  • arrays.PandasArray has been renamed NumpyExtensionArray and the attached dtype name changed from PandasDtype to NumpyEADtype; importing PandasArray still works until the next major version (GH 53694)

Deprecations#

Deprecated silent upcasting in setitem-like Series operations#

PDEP-6: https://pandas.pydata.org/pdeps/0006-ban-upcasting.html

Setitem-like operations on Series (or DataFrame columns) which silently upcast the dtype are deprecated and show a warning. Examples of affected operations are:

  • ser.fillna('foo', inplace=True)

  • ser.where(ser.isna(), 'foo', inplace=True)

  • ser.iloc[indexer] = 'foo'

  • ser.loc[indexer] = 'foo'

  • df.iloc[indexer, 0] = 'foo'

  • df.loc[indexer, 'a'] = 'foo'

  • ser[indexer] = 'foo'

where ser is a Series, df is a DataFrame, and indexer could be a slice, a mask, a single value, a list or array of values, or any other allowed indexer.

In a future version, these will raise an error and you should cast to a common dtype first.

Previous behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: ser[0] = 'not an int64'

In [4]: ser
Out[4]:
0    not an int64
1               2
2               3
dtype: object

New behavior:

In [1]: ser = pd.Series([1, 2, 3])

In [2]: ser
Out[2]:
0    1
1    2
2    3
dtype: int64

In [3]: ser[0] = 'not an int64'
FutureWarning:
  Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas.
  Value 'not an int64' has dtype incompatible with int64, please explicitly cast to a compatible dtype first.

In [4]: ser
Out[4]:
0    not an int64
1               2
2               3
dtype: object

To retain the current behaviour, in the case above you could cast ser to object dtype first:

In [21]: ser = pd.Series([1, 2, 3])

In [22]: ser = ser.astype('object')

In [23]: ser[0] = 'not an int64'

In [24]: ser
Out[24]: 
0    not an int64
1               2
2               3
dtype: object

Depending on the use-case, it might be more appropriate to cast to a different dtype. In the following, for example, we cast to float64:

In [25]: ser = pd.Series([1, 2, 3])

In [26]: ser = ser.astype('float64')

In [27]: ser[0] = 1.1

In [28]: ser
Out[28]: 
0    1.1
1    2.0
2    3.0
dtype: float64

For further reading, please see https://pandas.pydata.org/pdeps/0006-ban-upcasting.html.

Deprecated parsing datetimes with mixed time zones#

Parsing datetimes with mixed time zones is deprecated and shows a warning unless user passes utc=True to to_datetime() (GH 50887)

Previous behavior:

In [7]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [8]:  pd.to_datetime(data, utc=False)
Out[8]:
Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

New behavior:

In [9]: pd.to_datetime(data, utc=False)
FutureWarning:
  In a future version of pandas, parsing datetimes with mixed time zones will raise
  a warning unless `utc=True`. Please specify `utc=True` to opt in to the new behaviour
  and silence this warning. To create a `Series` with mixed offsets and `object` dtype,
  please use `apply` and `datetime.datetime.strptime`.
Index([2020-01-01 00:00:00+06:00, 2020-01-01 00:00:00+01:00], dtype='object')

In order to silence this warning and avoid an error in a future version of pandas, please specify utc=True:

In [29]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [30]: pd.to_datetime(data, utc=True)
Out[30]: DatetimeIndex(['2019-12-31 18:00:00+00:00', '2019-12-31 23:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

To create a Series with mixed offsets and object dtype, please use apply and datetime.datetime.strptime:

In [31]: import datetime as dt

In [32]: data = ["2020-01-01 00:00:00+06:00", "2020-01-01 00:00:00+01:00"]

In [33]: pd.Series(data).apply(lambda x: dt.datetime.strptime(x, '%Y-%m-%d %H:%M:%S%z'))
Out[33]: 
0    2020-01-01 00:00:00+06:00
1    2020-01-01 00:00:00+01:00
dtype: object

Other Deprecations#

Performance improvements#

Bug fixes#

Categorical#

Datetimelike#

  • DatetimeIndex.map() with na_action="ignore" now works as expected (GH 51644)

  • DatetimeIndex.slice_indexer() now raises KeyError for non-monotonic indexes if either of the slice bounds is not in the index; this behaviour was previously deprecated but inconsistently handled (GH 53983)

  • Bug in DateOffset which had inconsistent behavior when multiplying a DateOffset object by a constant (GH 47953)

  • Bug in date_range() when freq was a DateOffset with nanoseconds (GH 46877)

  • Bug in to_datetime() converting Series or DataFrame containing arrays.ArrowExtensionArray of PyArrow timestamps to numpy datetimes (GH 52545)

  • Bug in DatetimeArray.map() and DatetimeIndex.map(), where the supplied callable operated array-wise instead of element-wise (GH 51977)

  • Bug in DataFrame.to_sql() raising ValueError for PyArrow-backed date like dtypes (GH 53854)

  • Bug in Timestamp.date(), Timestamp.isocalendar(), Timestamp.timetuple(), and Timestamp.toordinal() were returning incorrect results for inputs outside those supported by the Python standard library’s datetime module (GH 53668)

  • Bug in Timestamp.round() with values close to the implementation bounds returning incorrect results instead of raising OutOfBoundsDatetime (GH 51494)

  • Bug in constructing a Series or DataFrame from a datetime or timedelta scalar always inferring nanosecond resolution instead of inferring from the input (GH 52212)

  • Bug in constructing a Timestamp from a string representing a time without a date inferring an incorrect unit (GH 54097)

  • Bug in constructing a Timestamp with ts_input=pd.NA raising TypeError (GH 45481)

  • Bug in parsing datetime strings with weekday but no day e.g. “2023 Sept Thu” incorrectly raising AttributeError instead of ValueError (GH 52659)

  • Bug in the repr for Series when dtype is a timezone aware datetime with non-nanosecond resolution raising OutOfBoundsDatetime (GH 54623)

Timedelta#

  • Bug in TimedeltaIndex division or multiplication leading to .freq of “0 Days” instead of None (GH 51575)

  • Bug in Timedelta with NumPy timedelta64 objects not properly raising ValueError (GH 52806)

  • Bug in to_timedelta() converting Series or DataFrame containing ArrowDtype of pyarrow.duration to NumPy timedelta64 (GH 54298)

  • Bug in Timedelta.__hash__(), raising an OutOfBoundsTimedelta on certain large values of second resolution (GH 54037)

  • Bug in Timedelta.round() with values close to the implementation bounds returning incorrect results instead of raising OutOfBoundsTimedelta (GH 51494)

  • Bug in TimedeltaIndex.map() with na_action="ignore" (GH 51644)

  • Bug in arrays.TimedeltaArray.map() and TimedeltaIndex.map(), where the supplied callable operated array-wise instead of element-wise (GH 51977)

Timezones#

  • Bug in infer_freq() that raises TypeError for Series of timezone-aware timestamps (GH 52456)

  • Bug in DatetimeTZDtype.base() that always returns a NumPy dtype with nanosecond resolution (GH 52705)

Numeric#

Conversion#

Strings#

  • Bug in Series.str() that did not raise a TypeError when iterated (GH 54173)

  • Bug in repr for DataFrame` with string-dtype columns (GH 54797)

Interval#

Indexing#

  • Bug in DataFrame.__setitem__() losing dtype when setting a DataFrame into duplicated columns (GH 53143)

  • Bug in DataFrame.__setitem__() with a boolean mask and DataFrame.putmask() with mixed non-numeric dtypes and a value other than NaN incorrectly raising TypeError (GH 53291)

  • Bug in DataFrame.iloc() when using nan as the only element (GH 52234)

  • Bug in Series.loc() casting Series to np.dnarray when assigning Series at predefined index of object dtype Series (GH 48933)

Missing#

MultiIndex#

I/O#

Period#

  • Bug in PeriodDtype constructor failing to raise TypeError when no argument is passed or when None is passed (GH 27388)

  • Bug in PeriodDtype constructor incorrectly returning the same normalize for different DateOffset freq inputs (GH 24121)

  • Bug in PeriodDtype constructor raising ValueError instead of TypeError when an invalid type is passed (GH 51790)

  • Bug in PeriodDtype where the object could be kept alive when deleted (GH 54184)

  • Bug in read_csv() not processing empty strings as a null value, with engine="pyarrow" (GH 52087)

  • Bug in read_csv() returning object dtype columns instead of float64 dtype columns with engine="pyarrow" for columns that are all null with engine="pyarrow" (GH 52087)

  • Bug in Period.now() not accepting the freq parameter as a keyword argument (GH 53369)

  • Bug in PeriodIndex.map() with na_action="ignore" (GH 51644)

  • Bug in arrays.PeriodArray.map() and PeriodIndex.map(), where the supplied callable operated array-wise instead of element-wise (GH 51977)

  • Bug in incorrectly allowing construction of Period or PeriodDtype with CustomBusinessDay freq; use BusinessDay instead (GH 52534)

Plotting#

Groupby/resample/rolling#

Reshaping#

Sparse#

  • Bug in SparseDtype constructor failing to raise TypeError when given an incompatible dtype for its subtype, which must be a NumPy dtype (GH 53160)

  • Bug in arrays.SparseArray.map() allowed the fill value to be included in the sparse values (GH 52095)

ExtensionArray#

  • Bug in ArrowStringArray constructor raises ValueError with dictionary types of strings (GH 54074)

  • Bug in DataFrame constructor not copying Series with extension dtype when given in dict (GH 53744)

  • Bug in ArrowExtensionArray converting pandas non-nanosecond temporal objects from non-zero values to zero values (GH 53171)

  • Bug in Series.quantile() for PyArrow temporal types raising ArrowInvalid (GH 52678)

  • Bug in Series.rank() returning wrong order for small values with Float64 dtype (GH 52471)

  • Bug in Series.unique() for boolean ArrowDtype with NA values (GH 54667)

  • Bug in __iter__() and __getitem__() returning python datetime and timedelta objects for non-nano dtypes (GH 53326)

  • Bug in factorize() returning incorrect uniques for a pyarrow.dictionary type pyarrow.chunked_array with more than one chunk (GH 54844)

  • Bug when passing an ExtensionArray subclass to dtype keywords. This will now raise a UserWarning to encourage passing an instance instead (GH 31356, GH 54592)

  • Bug where the DataFrame repr would not work when a column had an ArrowDtype with a pyarrow.ExtensionDtype (GH 54063)

  • Bug where the __from_arrow__ method of masked ExtensionDtypes (e.g. Float64Dtype, BooleanDtype) would not accept PyArrow arrays of type pyarrow.null() (GH 52223)

Styler#

  • Bug in Styler._copy() calling overridden methods in subclasses of Styler (GH 52728)

Metadata#

Other#

Contributors#

For contributors, please see /usr/share/doc/contributors_list.txt or https://github.com/pandas-dev/pandas/graphs/contributors