What’s new in 1.2.0 (December 26, 2020)¶
These are the changes in pandas 1.2.0. See Release notes for a full changelog including other versions of pandas.
Warning
The xlwt package for writing old-style .xls
excel files is no longer maintained.
The xlrd package is now only for reading
old-style .xls
files.
Previously, the default argument engine=None
to read_excel()
would result in using the xlrd
engine in many cases, including new
Excel 2007+ (.xlsx
) files.
If openpyxl is installed,
many of these cases will now default to using the openpyxl
engine.
See the read_excel()
documentation for more details.
Thus, it is strongly encouraged to install openpyxl
to read Excel 2007+
(.xlsx
) files.
Please do not report issues when using ``xlrd`` to read ``.xlsx`` files.
This is no longer supported, switch to using openpyxl
instead.
Attempting to use the xlwt
engine will raise a FutureWarning
unless the option io.excel.xls.writer
is set to "xlwt"
.
While this option is now deprecated and will also raise a FutureWarning
,
it can be globally set and the warning suppressed. Users are recommended to
write .xlsx
files using the openpyxl
engine instead.
Enhancements¶
Optionally disallow duplicate labels¶
Series
and DataFrame
can now be created with allows_duplicate_labels=False
flag to
control whether the index or columns can contain duplicate labels (GH28394). This can be used to
prevent accidental introduction of duplicate labels, which can affect downstream operations.
By default, duplicates continue to be allowed.
In [1]: pd.Series([1, 2], index=['a', 'a'])
Out[1]:
a 1
a 2
Length: 2, dtype: int64
In [2]: pd.Series([1, 2], index=['a', 'a']).set_flags(allows_duplicate_labels=False)
...
DuplicateLabelError: Index has duplicates.
positions
label
a [0, 1]
pandas will propagate the allows_duplicate_labels
property through many operations.
In [3]: a = (
...: pd.Series([1, 2], index=['a', 'b'])
...: .set_flags(allows_duplicate_labels=False)
...: )
In [4]: a
Out[4]:
a 1
b 2
Length: 2, dtype: int64
# An operation introducing duplicates
In [5]: a.reindex(['a', 'b', 'a'])
...
DuplicateLabelError: Index has duplicates.
positions
label
a [0, 2]
[1 rows x 1 columns]
Warning
This is an experimental feature. Currently, many methods fail to
propagate the allows_duplicate_labels
value. In future versions
it is expected that every method taking or returning one or more
DataFrame or Series objects will propagate allows_duplicate_labels
.
See Duplicate Labels for more.
The allows_duplicate_labels
flag is stored in the new DataFrame.flags
attribute. This stores global attributes that apply to the pandas object. This
differs from DataFrame.attrs
, which stores information that applies to
the dataset.
Passing arguments to fsspec backends¶
Many read/write functions have acquired the storage_options
optional argument,
to pass a dictionary of parameters to the storage backend. This allows, for
example, for passing credentials to S3 and GCS storage. The details of what
parameters can be passed to which backends can be found in the documentation
of the individual storage backends (detailed from the fsspec docs for
builtin implementations and linked to external ones). See
Section Reading/writing remote files.
GH35655 added fsspec support (including storage_options
)
for reading excel files.
Support for binary file handles in to_csv
¶
to_csv()
supports file handles in binary mode (GH19827 and GH35058)
with encoding
(GH13068 and GH23854) and compression
(GH22555).
If pandas does not automatically detect whether the file handle is opened in binary or text mode,
it is necessary to provide mode="wb"
.
For example:
In [1]: import io
In [2]: data = pd.DataFrame([0, 1, 2])
In [3]: buffer = io.BytesIO()
In [4]: data.to_csv(buffer, encoding="utf-8", compression="gzip")
Change in default floating precision for read_csv
and read_table
¶
For the C parsing engine, the methods read_csv()
and read_table()
previously defaulted to a parser that
could read floating point numbers slightly incorrectly with respect to the last bit in precision.
The option floating_precision="high"
has always been available to avoid this issue.
Beginning with this version, the default is now to use the more accurate parser by making
floating_precision=None
correspond to the high precision parser, and the new option
floating_precision="legacy"
to use the legacy parser. The change to using the higher precision
parser by default should have no impact on performance. (GH17154)
Experimental nullable data types for float data¶
We’ve added Float32Dtype
/ Float64Dtype
and FloatingArray
.
These are extension data types dedicated to floating point data that can hold the
pd.NA
missing value indicator (GH32265, GH34307).
While the default float data type already supports missing values using np.nan
,
these new data types use pd.NA
(and its corresponding behavior) as the missing
value indicator, in line with the already existing nullable integer
and boolean data types.
One example where the behavior of np.nan
and pd.NA
is different is
comparison operations:
# the default NumPy float64 dtype
In [11]: s1 = pd.Series([1.5, None])
In [12]: s1
Out[12]:
0 1.5
1 NaN
dtype: float64
In [13]: s1 > 1
Out[13]:
0 True
1 False
dtype: bool
# the new nullable float64 dtype
In [14]: s2 = pd.Series([1.5, None], dtype="Float64")
In [15]: s2
Out[15]:
0 1.5
1 <NA>
dtype: Float64
In [16]: s2 > 1
Out[16]:
0 True
1 <NA>
dtype: boolean
See the Experimental NA scalar to denote missing values doc section for more details on the behavior
when using the pd.NA
missing value indicator.
As shown above, the dtype can be specified using the “Float64” or “Float32” string (capitalized to distinguish it from the default “float64” data type). Alternatively, you can also use the dtype object:
In [17]: pd.Series([1.5, None], dtype=pd.Float32Dtype())
Out[17]:
0 1.5
1 <NA>
dtype: Float32
Operations with the existing integer or boolean nullable data types that give float results will now also use the nullable floating data types (GH38178).
Warning
Experimental: the new floating data types are currently experimental, and their behavior or API may still change without warning. Especially the behavior regarding NaN (distinct from NA missing values) is subject to change.
Index/column name preservation when aggregating¶
When aggregating using concat()
or the DataFrame
constructor, pandas
will now attempt to preserve index and column names whenever possible (GH35847).
In the case where all inputs share a common name, this name will be assigned to the
result. When the input names do not all agree, the result will be unnamed. Here is an
example where the index name is preserved:
In [18]: idx = pd.Index(range(5), name='abc')
In [19]: ser = pd.Series(range(5, 10), index=idx)
In [20]: pd.concat({'x': ser[1:], 'y': ser[:-1]}, axis=1)
Out[20]:
x y
abc
1 6.0 6.0
2 7.0 7.0
3 8.0 8.0
4 9.0 NaN
0 NaN 5.0
The same is true for MultiIndex
, but the logic is applied separately on a
level-by-level basis.
GroupBy supports EWM operations directly¶
DataFrameGroupBy
now supports exponentially weighted window operations directly (GH16037).
In [21]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'b'], 'B': range(4)})
In [22]: df
Out[22]:
A B
0 a 0
1 b 1
2 a 2
3 b 3
In [23]: df.groupby('A').ewm(com=1.0).mean()
Out[23]:
B
A
a 0 0.000000
2 1.333333
b 1 1.000000
3 2.333333
Additionally mean
supports execution via Numba with
the engine
and engine_kwargs
arguments. Numba must be installed as an optional dependency
to use this feature.
Other enhancements¶
Added
day_of_week
(compatibility aliasdayofweek
) property toTimestamp
,DatetimeIndex
,Period
,PeriodIndex
(GH9605)Added
day_of_year
(compatibility aliasdayofyear
) property toTimestamp
,DatetimeIndex
,Period
,PeriodIndex
(GH9605)Added
set_flags()
for setting table-wide flags on a Series or DataFrame (GH28394)DataFrame.applymap()
now supportsna_action
(GH23803)Index
with object dtype supports division and multiplication (GH34160)io.sql.get_schema()
now supports aschema
keyword argument that will add a schema into the create table statement (GH28486)DataFrame.explode()
andSeries.explode()
now support exploding of sets (GH35614)DataFrame.hist()
now supports time series (datetime) data (GH32590)Styler.set_table_styles()
now allows the direct styling of rows and columns and can be chained (GH35607)Styler
now allows direct CSS class name addition to individual data cells (GH36159)Rolling.mean()
andRolling.sum()
use Kahan summation to calculate the mean to avoid numerical problems (GH10319, GH11645, GH13254, GH32761, GH36031)DatetimeIndex.searchsorted()
,TimedeltaIndex.searchsorted()
,PeriodIndex.searchsorted()
, andSeries.searchsorted()
with datetime-like dtypes will now try to cast string arguments (list-like and scalar) to the matching datetime-like type (GH36346)Added methods
IntegerArray.prod()
,IntegerArray.min()
, andIntegerArray.max()
(GH33790)Calling a NumPy ufunc on a
DataFrame
with extension types now preserves the extension types when possible (GH23743)Calling a binary-input NumPy ufunc on multiple
DataFrame
objects now aligns, matching the behavior of binary operations and ufuncs onSeries
(GH23743). This change has been reverted in pandas 1.2.1, and the behaviour to not align DataFrames is deprecated instead, see the the 1.2.1 release notes.Where possible
RangeIndex.difference()
andRangeIndex.symmetric_difference()
will returnRangeIndex
instead ofInt64Index
(GH36564)DataFrame.to_parquet()
now supportsMultiIndex
for columns in parquet format (GH34777)read_parquet()
gained ause_nullable_dtypes=True
option to use nullable dtypes that usepd.NA
as missing value indicator where possible for the resulting DataFrame (default isFalse
, and only applicable forengine="pyarrow"
) (GH31242)Added
Rolling.sem()
andExpanding.sem()
to compute the standard error of the mean (GH26476)Rolling.var()
andRolling.std()
use Kahan summation and Welford’s Method to avoid numerical issues (GH37051)DataFrame.corr()
andDataFrame.cov()
use Welford’s Method to avoid numerical issues (GH37448)DataFrame.plot()
now recognizesxlabel
andylabel
arguments for plots of typescatter
andhexbin
(GH37001)DataFrame.to_parquet()
now returns abytes
object when nopath
argument is passed (GH37105)Rolling
now supports theclosed
argument for fixed windows (GH34315)DatetimeIndex
andSeries
withdatetime64
ordatetime64tz
dtypes now supportstd
(GH37436)Window
now supports all Scipy window types inwin_type
with flexible keyword argument support (GH34556)testing.assert_index_equal()
now has acheck_order
parameter that allows indexes to be checked in an order-insensitive manner (GH37478)read_csv()
supports memory-mapping for compressed files (GH37621)Add support for
min_count
keyword forDataFrame.groupby()
andDataFrame.resample()
for functionsmin
,max
,first
andlast
(GH37821, GH37768)Improve error reporting for
DataFrame.merge()
when invalid merge column definitions were given (GH16228)Improve numerical stability for
Rolling.skew()
,Rolling.kurt()
,Expanding.skew()
andExpanding.kurt()
through implementation of Kahan summation (GH6929)Improved error reporting for subsetting columns of a
DataFrameGroupBy
withaxis=1
(GH37725)Implement method
cross
forDataFrame.merge()
andDataFrame.join()
(GH5401)When
read_csv()
,read_sas()
andread_json()
are called withchunksize
/iterator
they can be used in awith
statement as they return context-managers (GH38225)Augmented the list of named colors available for styling Excel exports, enabling all of CSS4 colors (GH38247)
Notable bug fixes¶
These are bug fixes that might have notable behavior changes.
Consistency of DataFrame Reductions¶
DataFrame.any()
and DataFrame.all()
with bool_only=True
now
determines whether to exclude object-dtype columns on a column-by-column basis,
instead of checking if all object-dtype columns can be considered boolean.
This prevents pathological behavior where applying the reduction on a subset of columns could result in a larger Series result. See (GH37799).
In [24]: df = pd.DataFrame({"A": ["foo", "bar"], "B": [True, False]}, dtype=object)
In [25]: df["C"] = pd.Series([True, True])
Previous behavior:
In [5]: df.all(bool_only=True)
Out[5]:
C True
dtype: bool
In [6]: df[["B", "C"]].all(bool_only=True)
Out[6]:
B False
C True
dtype: bool
New behavior:
In [26]: In [5]: df.all(bool_only=True)
Out[26]:
B False
C True
dtype: bool
In [27]: In [6]: df[["B", "C"]].all(bool_only=True)
Out[27]:
B False
C True
dtype: bool
Other DataFrame reductions with numeric_only=None
will also avoid
this pathological behavior (GH37827):
In [28]: df = pd.DataFrame({"A": [0, 1, 2], "B": ["a", "b", "c"]}, dtype=object)
Previous behavior:
In [3]: df.mean()
Out[3]: Series([], dtype: float64)
In [4]: df[["A"]].mean()
Out[4]:
A 1.0
dtype: float64
New behavior:
In [29]: df.mean()
Out[29]:
A 1.0
dtype: float64
In [30]: df[["A"]].mean()
Out[30]:
A 1.0
dtype: float64
Moreover, DataFrame reductions with numeric_only=None
will now be
consistent with their Series counterparts. In particular, for
reductions where the Series method raises TypeError
, the
DataFrame reduction will now consider that column non-numeric
instead of casting to a NumPy array which may have different semantics (GH36076,
GH28949, GH21020).
In [31]: ser = pd.Series([0, 1], dtype="category", name="A")
In [32]: df = ser.to_frame()
Previous behavior:
In [5]: df.any()
Out[5]:
A True
dtype: bool
New behavior:
In [33]: df.any()
Out[33]: Series([], dtype: bool)
Increased minimum version for Python¶
pandas 1.2.0 supports Python 3.7.1 and higher (GH35214).
Increased minimum versions for dependencies¶
Some minimum supported versions of dependencies were updated (GH35214). If installed, we now require:
Package |
Minimum Version |
Required |
Changed |
---|---|---|---|
numpy |
1.16.5 |
X |
X |
pytz |
2017.3 |
X |
X |
python-dateutil |
2.7.3 |
X |
|
bottleneck |
1.2.1 |
||
numexpr |
2.6.8 |
X |
|
pytest (dev) |
5.0.1 |
X |
|
mypy (dev) |
0.782 |
X |
For optional libraries the general recommendation is to use the latest version. The following table lists the lowest version per library that is currently being tested throughout the development of pandas. Optional libraries below the lowest tested version may still work, but are not considered supported.
Package |
Minimum Version |
Changed |
---|---|---|
beautifulsoup4 |
4.6.0 |
|
fastparquet |
0.3.2 |
|
fsspec |
0.7.4 |
|
gcsfs |
0.6.0 |
|
lxml |
4.3.0 |
X |
matplotlib |
2.2.3 |
X |
numba |
0.46.0 |
|
openpyxl |
2.6.0 |
X |
pyarrow |
0.15.0 |
X |
pymysql |
0.7.11 |
X |
pytables |
3.5.1 |
X |
s3fs |
0.4.0 |
|
scipy |
1.2.0 |
|
sqlalchemy |
1.2.8 |
X |
xarray |
0.12.3 |
X |
xlrd |
1.2.0 |
X |
xlsxwriter |
1.0.2 |
X |
xlwt |
1.3.0 |
X |
pandas-gbq |
0.12.0 |
See Dependencies and Optional dependencies for more.
Other API changes¶
Sorting in descending order is now stable for
Series.sort_values()
andIndex.sort_values()
for Datetime-likeIndex
subclasses. This will affect sort order when sorting a DataFrame on multiple columns, sorting with a key function that produces duplicates, or requesting the sorting index when usingIndex.sort_values()
. When usingSeries.value_counts()
, the count of missing values is no longer necessarily last in the list of duplicate counts. Instead, its position corresponds to the position in the original Series. When usingIndex.sort_values()
for Datetime-likeIndex
subclasses, NaTs ignored thena_position
argument and were sorted to the beginning. Now they respectna_position
, the default beinglast
, same as otherIndex
subclasses (GH35992)Passing an invalid
fill_value
toCategorical.take()
,DatetimeArray.take()
,TimedeltaArray.take()
, orPeriodArray.take()
now raises aTypeError
instead of aValueError
(GH37733)Passing an invalid
fill_value
toSeries.shift()
with aCategoricalDtype
now raises aTypeError
instead of aValueError
(GH37733)Passing an invalid value to
IntervalIndex.insert()
orCategoricalIndex.insert()
now raises aTypeError
instead of aValueError
(GH37733)Attempting to reindex a Series with a
CategoricalIndex
with an invalidfill_value
now raises aTypeError
instead of aValueError
(GH37733)CategoricalIndex.append()
with an index that contains non-category values will now cast instead of raisingTypeError
(GH38098)
Deprecations¶
Deprecated parameter
inplace
inMultiIndex.set_codes()
andMultiIndex.set_levels()
(GH35626)Deprecated parameter
dtype
of methodcopy()
for allIndex
subclasses. Use theastype()
method instead for changing dtype (GH35853)Deprecated parameters
levels
andcodes
inMultiIndex.copy()
. Use theset_levels()
andset_codes()
methods instead (GH36685)Date parser functions
parse_date_time()
,parse_date_fields()
,parse_all_fields()
andgeneric_parser()
frompandas.io.date_converters
are deprecated and will be removed in a future version; useto_datetime()
instead (GH35741)DataFrame.lookup()
is deprecated and will be removed in a future version, useDataFrame.melt()
andDataFrame.loc()
instead (GH35224)The method
Index.to_native_types()
is deprecated. Use.astype(str)
instead (GH28867)Deprecated indexing
DataFrame
rows with a single datetime-like string asdf[string]
(given the ambiguity whether it is indexing the rows or selecting a column), usedf.loc[string]
instead (GH36179)Deprecated
Index.is_all_dates()
(GH27744)The default value of
regex
forSeries.str.replace()
will change fromTrue
toFalse
in a future release. In addition, single character regular expressions will not be treated as literal strings whenregex=True
is set (GH24804)Deprecated automatic alignment on comparison operations between
DataFrame
andSeries
, doframe, ser = frame.align(ser, axis=1, copy=False)
before e.g.frame == ser
(GH28759)Rolling.count()
withmin_periods=None
will default to the size of the window in a future version (GH31302)Using “outer” ufuncs on DataFrames to return 4d ndarray is now deprecated. Convert to an ndarray first (GH23743)
Deprecated slice-indexing on tz-aware
DatetimeIndex
with naivedatetime
objects, to match scalar indexing behavior (GH36148)Index.ravel()
returning anp.ndarray
is deprecated, in the future this will return a view on the same index (GH19956)Deprecate use of strings denoting units with ‘M’, ‘Y’ or ‘y’ in
to_timedelta()
(GH36666)Index
methods&
,|
, and^
behaving as the set operationsIndex.intersection()
,Index.union()
, andIndex.symmetric_difference()
, respectively, are deprecated and in the future will behave as pointwise boolean operations matchingSeries
behavior. Use the named set methods instead (GH36758)Categorical.is_dtype_equal()
andCategoricalIndex.is_dtype_equal()
are deprecated, will be removed in a future version (GH37545)Series.slice_shift()
andDataFrame.slice_shift()
are deprecated, useSeries.shift()
orDataFrame.shift()
instead (GH37601)Partial slicing on unordered
DatetimeIndex
objects with keys that are not in the index is deprecated and will be removed in a future version (GH18531)The
how
keyword inPeriodIndex.astype()
is deprecated and will be removed in a future version, useindex.to_timestamp(how=how)
instead (GH37982)Deprecated
Index.asi8()
forIndex
subclasses other thanDatetimeIndex
,TimedeltaIndex
, andPeriodIndex
(GH37877)The
inplace
parameter ofCategorical.remove_unused_categories()
is deprecated and will be removed in a future version (GH37643)The
null_counts
parameter ofDataFrame.info()
is deprecated and replaced byshow_counts
. It will be removed in a future version (GH37999)
Calling NumPy ufuncs on non-aligned DataFrames
Calling NumPy ufuncs on non-aligned DataFrames changed behaviour in pandas 1.2.0 (to align the inputs before calling the ufunc), but this change is reverted in pandas 1.2.1. The behaviour to not align is now deprecated instead, see the the 1.2.1 release notes for more details.
Performance improvements¶
Performance improvements when creating DataFrame or Series with dtype
str
orStringDtype
from array with many string elements (GH36304, GH36317, GH36325, GH36432, GH37371)Performance improvement in
GroupBy.agg()
with thenumba
engine (GH35759)Performance improvements when creating
Series.map()
from a huge dictionary (GH34717)Performance improvement in
GroupBy.transform()
with thenumba
engine (GH36240)Styler
uuid method altered to compress data transmission over web whilst maintaining reasonably low table collision probability (GH36345)Performance improvement in
to_datetime()
with non-ns time unit forfloat
dtype
columns (GH20445)Performance improvement in setting values on an
IntervalArray
(GH36310)The internal index method
_shallow_copy()
now makes the new index and original index share cached attributes, avoiding creating these again, if created on either. This can speed up operations that depend on creating copies of existing indexes (GH36840)Performance improvement in
RollingGroupby.count()
(GH35625)Small performance decrease to
Rolling.min()
andRolling.max()
for fixed windows (GH36567)Reduced peak memory usage in
DataFrame.to_pickle()
when usingprotocol=5
in python 3.8+ (GH34244)Faster
dir
calls when the object has many index labels, e.g.dir(ser)
(GH37450)Performance improvement in
ExpandingGroupby
(GH37064)Performance improvement in
Series.astype()
andDataFrame.astype()
forCategorical
(GH8628)Performance improvement in
DataFrame.groupby()
forfloat
dtype
(GH28303), changes of the underlying hash-function can lead to changes in float based indexes sort ordering for ties (e.g.Index.value_counts()
)Performance improvement in
pd.isin()
for inputs with more than 1e6 elements (GH36611)Performance improvement for
DataFrame.__setitem__()
with list-like indexers (GH37954)read_json()
now avoids reading entire file into memory when chunksize is specified (GH34548)
Bug fixes¶
Categorical¶
Categorical.fillna()
will always return a copy, validate a passed fill value regardless of whether there are any NAs to fill, and disallow anNaT
as a fill value for numeric categories (GH36530)Bug in
Categorical.__setitem__()
that incorrectly raised when trying to set a tuple value (GH20439)Bug in
CategoricalIndex.equals()
incorrectly casting non-category entries tonp.nan
(GH37667)Bug in
CategoricalIndex.where()
incorrectly setting non-category entries tonp.nan
instead of raisingTypeError
(GH37977)Bug in
Categorical.to_numpy()
andnp.array(categorical)
with tz-awaredatetime64
categories incorrectly dropping the time zone information instead of casting to object dtype (GH38136)
Datetime-like¶
Bug in
DataFrame.combine_first()
that would convert datetime-like column on otherDataFrame
to integer when the column is not present in originalDataFrame
(GH28481)Bug in
DatetimeArray.date
where aValueError
would be raised with a read-only backing array (GH33530)Bug in
NaT
comparisons failing to raiseTypeError
on invalid inequality comparisons (GH35046)Bug in
DateOffset
where attributes reconstructed from pickle files differ from original objects when input values exceed normal ranges (e.g. months=12) (GH34511)Bug in
DatetimeIndex.get_slice_bound()
wheredatetime.date
objects were not accepted or naiveTimestamp
with a tz-awareDatetimeIndex
(GH35690)Bug in
DatetimeIndex.slice_locs()
wheredatetime.date
objects were not accepted (GH34077)Bug in
DatetimeIndex.searchsorted()
,TimedeltaIndex.searchsorted()
,PeriodIndex.searchsorted()
, andSeries.searchsorted()
withdatetime64
,timedelta64
orPeriod
dtype placement ofNaT
values being inconsistent with NumPy (GH36176, GH36254)Inconsistency in
DatetimeArray
,TimedeltaArray
, andPeriodArray
method__setitem__
casting arrays of strings to datetime-like scalars but not scalar strings (GH36261)Bug in
DatetimeArray.take()
incorrectly allowingfill_value
with a mismatched time zone (GH37356)Bug in
DatetimeIndex.shift
incorrectly raising when shifting empty indexes (GH14811)Timestamp
andDatetimeIndex
comparisons between tz-aware and tz-naive objects now follow the standard librarydatetime
behavior, returningTrue
/False
for!=
/==
and raising for inequality comparisons (GH28507)Bug in
DatetimeIndex.equals()
andTimedeltaIndex.equals()
incorrectly consideringint64
indexes as equal (GH36744)Series.to_json()
,DataFrame.to_json()
, andread_json()
now implement time zone parsing when orient structure istable
(GH35973)astype()
now attempts to convert todatetime64[ns, tz]
directly fromobject
with inferred time zone from string (GH35973)Bug in
TimedeltaIndex.sum()
andSeries.sum()
withtimedelta64
dtype on an empty index or series returningNaT
instead ofTimedelta(0)
(GH31751)Bug in
DatetimeArray.shift()
incorrectly allowingfill_value
with a mismatched time zone (GH37299)Bug in adding a
BusinessDay
with nonzerooffset
to a non-scalar other (GH37457)Bug in
to_datetime()
with a read-only array incorrectly raising (GH34857)Bug in
Series.isin()
withdatetime64[ns]
dtype andDatetimeIndex.isin()
incorrectly casting integers to datetimes (GH36621)Bug in
Series.isin()
withdatetime64[ns]
dtype andDatetimeIndex.isin()
failing to consider tz-aware and tz-naive datetimes as always different (GH35728)Bug in
Series.isin()
withPeriodDtype
dtype andPeriodIndex.isin()
failing to consider arguments with differentPeriodDtype
as always different (GH37528)Bug in
Period
constructor now correctly handles nanoseconds in thevalue
argument (GH34621 and GH17053)
Timedelta¶
Bug in
TimedeltaIndex
,Series
, andDataFrame
floor-division withtimedelta64
dtypes andNaT
in the denominator (GH35529)Bug in parsing of ISO 8601 durations in
Timedelta
andto_datetime()
(GH29773, GH36204)Bug in
to_timedelta()
with a read-only array incorrectly raising (GH34857)Bug in
Timedelta
incorrectly truncating to sub-second portion of a string input when it has precision higher than nanoseconds (GH36738)
Timezones¶
Bug in
date_range()
was raisingAmbiguousTimeError
for valid input withambiguous=False
(GH35297)Bug in
Timestamp.replace()
was losing fold information (GH37610)
Numeric¶
Bug in
to_numeric()
where float precision was incorrect (GH31364)Bug in
DataFrame.any()
withaxis=1
andbool_only=True
ignoring thebool_only
keyword (GH32432)Bug in
Series.equals()
where aValueError
was raised when NumPy arrays were compared to scalars (GH35267)Bug in
Series
where two Series each have aDatetimeIndex
with different time zones having those indexes incorrectly changed when performing arithmetic operations (GH33671)Bug in
pandas.testing
module functions when used withcheck_exact=False
on complex numeric types (GH28235)Bug in
DataFrame.__rmatmul__()
error handling reporting transposed shapes (GH21581)Bug in
Series
flex arithmetic methods where the result when operating with alist
,tuple
ornp.ndarray
would have an incorrect name (GH36760)Bug in
IntegerArray
multiplication withtimedelta
andnp.timedelta64
objects (GH36870)Bug in
MultiIndex
comparison with tuple incorrectly treating tuple as array-like (GH21517)Bug in
DataFrame.diff()
withdatetime64
dtypes includingNaT
values failing to fillNaT
results correctly (GH32441)Bug in
DataFrame
arithmetic ops incorrectly accepting keyword arguments (GH36843)Bug in
IntervalArray
comparisons withSeries
not returning Series (GH36908)Bug in
DataFrame
allowing arithmetic operations with list of array-likes with undefined results. Behavior changed to raisingValueError
(GH36702)Bug in
DataFrame.std()
withtimedelta64
dtype andskipna=False
(GH37392)Bug in
DataFrame.min()
andDataFrame.max()
withdatetime64
dtype andskipna=False
(GH36907)Bug in
DataFrame.idxmax()
andDataFrame.idxmin()
with mixed dtypes incorrectly raisingTypeError
(GH38195)
Conversion¶
Bug in
DataFrame.to_dict()
withorient='records'
now returns python native datetime objects for datetime-like columns (GH21256)Bug in
Series.astype()
conversion fromstring
tofloat
raised in presence ofpd.NA
values (GH37626)
Strings¶
Bug in
Series.to_string()
,DataFrame.to_string()
, andDataFrame.to_latex()
adding a leading space whenindex=False
(GH24980)Bug in
to_numeric()
raising aTypeError
when attempting to convert a string dtype Series containing only numeric strings andNA
(GH37262)
Interval¶
Bug in
DataFrame.replace()
andSeries.replace()
whereInterval
dtypes would be converted to object dtypes (GH34871)Bug in
IntervalIndex.take()
with negative indices andfill_value=None
(GH37330)Bug in
IntervalIndex.putmask()
with datetime-like dtype incorrectly casting to object dtype (GH37968)Bug in
IntervalArray.astype()
incorrectly dropping dtype information with aCategoricalDtype
object (GH37984)
Indexing¶
Bug in
PeriodIndex.get_loc()
incorrectly raisingValueError
on non-datelike strings instead ofKeyError
, causing similar errors inSeries.__getitem__()
,Series.__contains__()
, andSeries.loc.__getitem__()
(GH34240)Bug in
Index.sort_values()
where, when empty values were passed, the method would break by trying to compare missing values instead of pushing them to the end of the sort order (GH35584)Bug in
Index.get_indexer()
andIndex.get_indexer_non_unique()
whereint64
arrays are returned instead ofintp
(GH36359)Bug in
DataFrame.sort_index()
where parameter ascending passed as a list on a single level index gives wrong result (GH32334)Bug in
DataFrame.reset_index()
was incorrectly raising aValueError
for input with aMultiIndex
with missing values in a level withCategorical
dtype (GH24206)Bug in indexing with boolean masks on datetime-like values sometimes returning a view instead of a copy (GH36210)
Bug in
DataFrame.__getitem__()
andDataFrame.loc.__getitem__()
withIntervalIndex
columns and a numeric indexer (GH26490)Bug in
Series.loc.__getitem__()
with a non-uniqueMultiIndex
and an empty-list indexer (GH13691)Bug in indexing on a
Series
orDataFrame
with aMultiIndex
and a level named"0"
(GH37194)Bug in
Series.__getitem__()
when using an unsigned integer array as an indexer giving incorrect results or segfaulting instead of raisingKeyError
(GH37218)Bug in
Index.where()
incorrectly casting numeric values to strings (GH37591)Bug in
DataFrame.loc()
returning empty result when indexer is a slice with negative step size (GH38071)Bug in
Series.loc()
andDataFrame.loc()
raises when the index was ofobject
dtype and the given numeric label was in the index (GH26491)Bug in
DataFrame.loc()
returned requested key plus missing values whenloc
was applied to single level from aMultiIndex
(GH27104)Bug in indexing on a
Series
orDataFrame
with aCategoricalIndex
using a list-like indexer containing NA values (GH37722)Bug in
DataFrame.loc.__setitem__()
expanding an emptyDataFrame
with mixed dtypes (GH37932)Bug in
DataFrame.xs()
ignoreddroplevel=False
for columns (GH19056)Bug in
DataFrame.reindex()
raisingIndexingError
wrongly for empty DataFrame withtolerance
notNone
ormethod="nearest"
(GH27315)Bug in indexing on a
Series
orDataFrame
with aCategoricalIndex
using list-like indexer that contains elements that are in the index’scategories
but not in the index itself failing to raiseKeyError
(GH37901)Bug on inserting a boolean label into a
DataFrame
with a numericIndex
columns incorrectly casting to integer (GH36319)Bug in
DataFrame.iloc()
andSeries.iloc()
aligning objects in__setitem__
(GH22046)Bug in
MultiIndex.drop()
does not raise if labels are partially found (GH37820)Bug in
DataFrame.loc()
did not raiseKeyError
when missing combination was given withslice(None)
for remaining levels (GH19556)Bug in
DataFrame.loc()
raisingTypeError
when non-integer slice was given to select values fromMultiIndex
(GH25165, GH24263)Bug in
Series.at()
returningSeries
with one element instead of scalar when index is aMultiIndex
with one level (GH38053)Bug in
DataFrame.loc()
returning and assigning elements in wrong order when indexer is differently ordered than theMultiIndex
to filter (GH31330, GH34603)Bug in
DataFrame.loc()
andDataFrame.__getitem__()
raisingKeyError
when columns wereMultiIndex
with only one level (GH29749)Bug in
Series.__getitem__()
andDataFrame.__getitem__()
raising blankKeyError
without missing keys forIntervalIndex
(GH27365)Bug in setting a new label on a
DataFrame
orSeries
with aCategoricalIndex
incorrectly raisingTypeError
when the new label is not among the index’s categories (GH38098)Bug in
Series.loc()
andSeries.iloc()
raisingValueError
when inserting a list-likenp.array
,list
ortuple
in anobject
Series of equal length (GH37748, GH37486)Bug in
Series.loc()
andSeries.iloc()
setting all the values of anobject
Series with those of a list-likeExtensionArray
instead of inserting it (GH38271)
Missing¶
Bug in
SeriesGroupBy.transform()
now correctly handles missing values fordropna=False
(GH35014)Bug in
Series.nunique()
withdropna=True
was returning incorrect results when bothNA
andNone
missing values were present (GH37566)Bug in
Series.interpolate()
where kwarglimit_area
andlimit_direction
had no effect when using methodspad
andbackfill
(GH31048)
MultiIndex¶
Bug in
DataFrame.xs()
when used withIndexSlice
raisesTypeError
with message"Expected label or tuple of labels"
(GH35301)Bug in
DataFrame.reset_index()
withNaT
values in index raisesValueError
with message"cannot convert float NaN to integer"
(GH36541)Bug in
DataFrame.combine_first()
when used withMultiIndex
containing string andNaN
values raisesTypeError
(GH36562)Bug in
MultiIndex.drop()
droppedNaN
values when non existing key was given as input (GH18853)Bug in
MultiIndex.drop()
dropping more values than expected when index has duplicates and is not sorted (GH33494)
I/O¶
read_sas()
no longer leaks resources on failure (GH35566)Bug in
DataFrame.to_csv()
andSeries.to_csv()
caused aValueError
when it was called with a filename in combination withmode
containing ab
(GH35058)Bug in
read_csv()
withfloat_precision='round_trip'
did not handledecimal
andthousands
parameters (GH35365)to_pickle()
andread_pickle()
were closing user-provided file objects (GH35679)to_csv()
passes compression arguments for'gzip'
always togzip.GzipFile
(GH28103)to_csv()
did not support zip compression for binary file object not having a filename (GH35058)to_csv()
andread_csv()
did not honorcompression
andencoding
for path-like objects that are internally converted to file-like objects (GH35677, GH26124, GH32392)DataFrame.to_pickle()
,Series.to_pickle()
, andread_pickle()
did not support compression for file-objects (GH26237, GH29054, GH29570)Bug in
LongTableBuilder.middle_separator()
was duplicating LaTeX longtable entries in the List of Tables of a LaTeX document (GH34360)Bug in
read_csv()
withengine='python'
truncating data if multiple items present in first row and first element started with BOM (GH36343)Removed
private_key
andverbose
fromread_gbq()
as they are no longer supported inpandas-gbq
(GH34654, GH30200)Bumped minimum pytables version to 3.5.1 to avoid a
ValueError
inread_hdf()
(GH24839)Bug in
read_table()
andread_csv()
whendelim_whitespace=True
andsep=default
(GH36583)Bug in
DataFrame.to_json()
andSeries.to_json()
when used withlines=True
andorient='records'
the last line of the record is not appended with ‘new line character’ (GH36888)Bug in
read_parquet()
with fixed offset time zones. String representation of time zones was not recognized (GH35997, GH36004)Bug in
DataFrame.to_html()
,DataFrame.to_string()
, andDataFrame.to_latex()
ignoring thena_rep
argument whenfloat_format
was also specified (GH9046, GH13828)Bug in output rendering of complex numbers showing too many trailing zeros (GH36799)
Bug in
HDFStore
threw aTypeError
when exporting an empty DataFrame withdatetime64[ns, tz]
dtypes with a fixed HDF5 store (GH20594)Bug in
HDFStore
was dropping time zone information when exporting a Series withdatetime64[ns, tz]
dtypes with a fixed HDF5 store (GH20594)read_csv()
was closing user-provided binary file handles whenengine="c"
and anencoding
was requested (GH36980)Bug in
DataFrame.to_hdf()
was not dropping missing rows withdropna=True
(GH35719)Bug in
read_html()
was raising aTypeError
when supplying apathlib.Path
argument to theio
parameter (GH37705)DataFrame.to_excel()
,Series.to_excel()
,DataFrame.to_markdown()
, andSeries.to_markdown()
now support writing to fsspec URLs such as S3 and Google Cloud Storage (GH33987)Bug in
read_fwf()
withskip_blank_lines=True
was not skipping blank lines (GH37758)Parse missing values using
read_json()
withdtype=False
toNaN
instead ofNone
(GH28501)read_fwf()
was inferring compression withcompression=None
which was not consistent with the otherread_*
functions (GH37909)DataFrame.to_html()
was ignoringformatters
argument forExtensionDtype
columns (GH36525)Bumped minimum xarray version to 0.12.3 to avoid reference to the removed
Panel
class (GH27101, GH37983)DataFrame.to_csv()
was re-opening file-like handles that also implementos.PathLike
(GH38125)Bug in the conversion of a sliced
pyarrow.Table
with missing values to a DataFrame (GH38525)Bug in
read_sql_table()
raising asqlalchemy.exc.OperationalError
when column names contained a percentage sign (GH37517)
Period¶
Bug in
DataFrame.replace()
andSeries.replace()
wherePeriod
dtypes would be converted to object dtypes (GH34871)
Plotting¶
Bug in
DataFrame.plot()
was rotating xticklabels whensubplots=True
, even if the x-axis wasn’t an irregular time series (GH29460)Bug in
DataFrame.plot()
where a marker letter in thestyle
keyword sometimes caused aValueError
(GH21003)Bug in
DataFrame.plot.bar()
andSeries.plot.bar()
where ticks positions were assigned by value order instead of using the actual value for numeric or a smart ordering for string (GH26186, GH11465). This fix has been reverted in pandas 1.2.1, see What’s new in 1.2.1 (January 20, 2021)Twinned axes were losing their tick labels which should only happen to all but the last row or column of ‘externally’ shared axes (GH33819)
Bug in
Series.plot()
andDataFrame.plot()
was throwing aValueError
when the Series or DataFrame was indexed by aTimedeltaIndex
with a fixed frequency and the x-axis lower limit was greater than the upper limit (GH37454)Bug in
DataFrameGroupBy.boxplot()
whensubplots=False
would raise aKeyError
(GH16748)Bug in
DataFrame.plot()
andSeries.plot()
was overwriting matplotlib’s shared y axes behavior when nosharey
parameter was passed (GH37942)Bug in
DataFrame.plot()
was raising aTypeError
withExtensionDtype
columns (GH32073)
Styler¶
Bug in
Styler.render()
HTML was generated incorrectly because of formatting error inrowspan
attribute, it now matches with w3 syntax (GH38234)
Groupby/resample/rolling¶
Bug in
DataFrameGroupBy.count()
andSeriesGroupBy.sum()
returningNaN
for missing categories when grouped on multipleCategoricals
. Now returning0
(GH35028)Bug in
DataFrameGroupBy.apply()
that would sometimes throw an erroneousValueError
if the grouping axis had duplicate entries (GH16646)Bug in
DataFrame.resample()
that would throw aValueError
when resampling from"D"
to"24H"
over a transition into daylight savings time (DST) (GH35219)Bug when combining methods
DataFrame.groupby()
withDataFrame.resample()
andDataFrame.interpolate()
raising aTypeError
(GH35325)Bug in
DataFrameGroupBy.apply()
where a non-nuisance grouping column would be dropped from the output columns if another groupby method was called before.apply
(GH34656)Bug when subsetting columns on a
DataFrameGroupBy
(e.g.df.groupby('a')[['b']])
) would reset the attributesaxis
,dropna
,group_keys
,level
,mutated
,sort
, andsqueeze
to their default values (GH9959)Bug in
DataFrameGroupBy.tshift()
failing to raiseValueError
when a frequency cannot be inferred for the index of a group (GH35937)Bug in
DataFrame.groupby()
does not always maintain column index name forany
,all
,bfill
,ffill
,shift
(GH29764)Bug in
DataFrameGroupBy.apply()
raising error withnp.nan
group(s) whendropna=False
(GH35889)Bug in
Rolling.sum()
returned wrong values when dtypes where mixed between float and integer andaxis=1
(GH20649, GH35596)Bug in
Rolling.count()
returnednp.nan
withFixedForwardWindowIndexer
as window,min_periods=0
and only missing values in the window (GH35579)Bug where
pandas.core.window.Rolling
produces incorrect window sizes when using aPeriodIndex
(GH34225)Bug in
DataFrameGroupBy.ffill()
andDataFrameGroupBy.bfill()
where aNaN
group would return filled values instead ofNaN
whendropna=True
(GH34725)Bug in
RollingGroupby.count()
where aValueError
was raised when specifying theclosed
parameter (GH35869)Bug in
DataFrameGroupBy.rolling()
returning wrong values with partial centered window (GH36040)Bug in
DataFrameGroupBy.rolling()
returned wrong values with time aware window containingNaN
. RaisesValueError
because windows are not monotonic now (GH34617)Bug in
Rolling.__iter__()
where aValueError
was not raised whenmin_periods
was larger thanwindow
(GH37156)Using
Rolling.var()
instead ofRolling.std()
avoids numerical issues forRolling.corr()
whenRolling.var()
is still within floating point precision whileRolling.std()
is not (GH31286)Bug in
DataFrameGroupBy.quantile()
andResampler.quantile()
raisedTypeError
when values were of typeTimedelta
(GH29485)Bug in
Rolling.median()
andRolling.quantile()
returned wrong values forBaseIndexer
subclasses with non-monotonic starting or ending points for windows (GH37153)Bug in
DataFrame.groupby()
droppednan
groups from result withdropna=False
when grouping over a single column (GH35646, GH35542)Bug in
DataFrameGroupBy.head()
,DataFrameGroupBy.tail()
,SeriesGroupBy.head()
, andSeriesGroupBy.tail()
would raise when used withaxis=1
(GH9772)Bug in
DataFrameGroupBy.transform()
would raise when used withaxis=1
and a transformation kernel (e.g. “shift”) (GH36308)Bug in
DataFrameGroupBy.resample()
using.agg
with sum produced different result than just calling.sum
(GH33548)Bug in
DataFrameGroupBy.apply()
dropped values onnan
group when returning the same axes with the original frame (GH38227)Bug in
DataFrameGroupBy.quantile()
couldn’t handle with arraylikeq
when grouping by columns (GH33795)Bug in
DataFrameGroupBy.rank()
withdatetime64tz
or period dtype incorrectly casting results to those dtypes instead of returningfloat64
dtype (GH38187)
Reshaping¶
Bug in
DataFrame.crosstab()
was returning incorrect results on inputs with duplicate row names, duplicate column names or duplicate names between row and column labels (GH22529)Bug in
DataFrame.pivot_table()
withaggfunc='count'
oraggfunc='sum'
returningNaN
for missing categories when pivoted on aCategorical
. Now returning0
(GH31422)Bug in
concat()
andDataFrame
constructor where input index names are not preserved in some cases (GH13475)Bug in func
crosstab()
when using multiple columns withmargins=True
andnormalize=True
(GH35144)Bug in
DataFrame.stack()
where an empty DataFrame.stack would raise an error (GH36113). Now returning an empty Series with empty MultiIndex.Bug in
Series.unstack()
. Now a Series with single level of Index trying to unstack would raise aValueError
(GH36113)Bug in
DataFrame.agg()
withfunc={'name':<FUNC>}
incorrectly raisingTypeError
whenDataFrame.columns==['Name']
(GH36212)Bug in
Series.transform()
would give incorrect results or raise when the argumentfunc
was a dictionary (GH35811)Bug in
DataFrame.pivot()
did not preserveMultiIndex
level names for columns when rows and columns are both multiindexed (GH36360)Bug in
DataFrame.pivot()
modifiedindex
argument whencolumns
was passed butvalues
was not (GH37635)Bug in
DataFrame.join()
returned a non deterministic level-order for the resultingMultiIndex
(GH36910)Bug in
DataFrame.combine_first()
caused wrong alignment with dtypestring
and one level ofMultiIndex
containing onlyNA
(GH37591)Fixed regression in
merge()
on mergingDatetimeIndex
with empty DataFrame (GH36895)Bug in
DataFrame.apply()
not setting index of return value whenfunc
return type isdict
(GH37544)Bug in
DataFrame.merge()
andpandas.merge()
returning inconsistent ordering in result forhow=right
andhow=left
(GH35382)Bug in
merge_ordered()
couldn’t handle list-likeleft_by
orright_by
(GH35269)Bug in
merge_ordered()
returned wrong join result when length ofleft_by
orright_by
equals to the rows ofleft
orright
(GH38166)Bug in
merge_ordered()
didn’t raise when elements inleft_by
orright_by
not exist inleft
columns orright
columns (GH38167)Bug in
DataFrame.drop_duplicates()
not validating bool dtype forignore_index
keyword (GH38274)
ExtensionArray¶
Fixed bug where
DataFrame
column set to scalar extension type via a dict instantiation was considered an object type rather than the extension type (GH35965)Fixed bug where
astype()
with equal dtype andcopy=False
would return a new object (GH28488)Fixed bug when applying a NumPy ufunc with multiple outputs to an
IntegerArray
returningNone
(GH36913)Fixed an inconsistency in
PeriodArray
’s__init__
signature to those ofDatetimeArray
andTimedeltaArray
(GH37289)Reductions for
BooleanArray
,Categorical
,DatetimeArray
,FloatingArray
,IntegerArray
,PeriodArray
,TimedeltaArray
, andPandasArray
are now keyword-only methods (GH37541)Fixed a bug where a
TypeError
was wrongly raised if a membership check was made on anExtensionArray
containing nan-like values (GH37867)
Other¶
Bug in
DataFrame.replace()
andSeries.replace()
incorrectly raising anAssertionError
instead of aValueError
when invalid parameter combinations are passed (GH36045)Bug in
DataFrame.replace()
andSeries.replace()
with numeric values and stringto_replace
(GH34789)Fixed metadata propagation in
Series.abs()
and ufuncs called on Series and DataFrames (GH28283)Bug in
DataFrame.replace()
andSeries.replace()
incorrectly casting fromPeriodDtype
to object dtype (GH34871)Fixed bug in metadata propagation incorrectly copying DataFrame columns as metadata when the column name overlaps with the metadata name (GH37037)
Fixed metadata propagation in the
Series.dt
,Series.str
accessors,DataFrame.duplicated
,DataFrame.stack
,DataFrame.unstack
,DataFrame.pivot
,DataFrame.append
,DataFrame.diff
,DataFrame.applymap
andDataFrame.update
methods (GH28283, GH37381)Fixed metadata propagation when selecting columns with
DataFrame.__getitem__
(GH28283)Bug in
Index.intersection()
with non-Index
failing to set the correct name on the returnedIndex
(GH38111)Bug in
RangeIndex.intersection()
failing to set the correct name on the returnedIndex
in some corner cases (GH38197)Bug in
Index.difference()
failing to set the correct name on the returnedIndex
in some corner cases (GH38268)Bug in
Index.union()
behaving differently depending on whether operand is anIndex
or other list-like (GH36384)Bug in
Index.intersection()
with non-matching numeric dtypes casting toobject
dtype instead of minimal common dtype (GH38122)Bug in
IntervalIndex.union()
returning an incorrectly-typedIndex
when empty (GH38282)Passing an array with 2 or more dimensions to the
Series
constructor now raises the more specificValueError
rather than a bareException
(GH35744)Bug in
dir
wheredir(obj)
wouldn’t show attributes defined on the instance for pandas objects (GH37173)Bug in
Index.drop()
raisingInvalidIndexError
when index has duplicates (GH38051)Bug in
RangeIndex.difference()
returningInt64Index
in some cases where it should returnRangeIndex
(GH38028)Fixed bug in
assert_series_equal()
when comparing a datetime-like array with an equivalent non extension dtype array (GH37609)Bug in
is_bool_dtype()
would raise when passed a valid string such as"boolean"
(GH38386)Fixed regression in logical operators raising
ValueError
when columns ofDataFrame
are aCategoricalIndex
with unused categories (GH38367)