Comparison with spreadsheets#

Since many potential pandas users have some familiarity with spreadsheet programs like Excel, this page is meant to provide some examples of how various spreadsheet operations would be performed using pandas. This page will use terminology and link to documentation for Excel, but much will be the same/similar in Google Sheets, LibreOffice Calc, Apple Numbers, and other Excel-compatible spreadsheet software.

If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the library.

As is customary, we import pandas and NumPy as follows:

In [1]: import pandas as pd

In [2]: import numpy as np

Data structures#

General terminology translation#

pandas

Excel

DataFrame

worksheet

Series

column

Index

row headings

row

row

NaN

empty cell

DataFrame#

A DataFrame in pandas is analogous to an Excel worksheet. While an Excel workbook can contain multiple worksheets, pandas DataFrames exist independently.

Series#

A Series is the data structure that represents one column of a DataFrame. Working with a Series is analogous to referencing a column of a spreadsheet.

Index#

Every DataFrame and Series has an Index, which are labels on the rows of the data. In pandas, if no index is specified, a RangeIndex is used by default (first row = 0, second row = 1, and so on), analogous to row headings/numbers in spreadsheets.

In pandas, indexes can be set to one (or multiple) unique values, which is like having a column that is used as the row identifier in a worksheet. Unlike most spreadsheets, these Index values can actually be used to reference the rows. (Note that this can be done in Excel with structured references.) For example, in spreadsheets, you would reference the first row as A1:Z1, while in pandas you could use populations.loc['Chicago'].

Index values are also persistent, so if you re-order the rows in a DataFrame, the label for a particular row don’t change.

See the indexing documentation for much more on how to use an Index effectively.

Copies vs. in place operations#

Most pandas operations return copies of the Series/DataFrame. To make the changes “stick”, you’ll need to either assign to a new variable:

sorted_df = df.sort_values("col1")

or overwrite the original one:

df = df.sort_values("col1")

Note

You will see an inplace=True or copy=False keyword argument available for some methods:

df.replace(5, inplace=True)

There is an active discussion about deprecating and removing inplace and copy for most methods (e.g. dropna) except for a very small subset of methods (including replace). Both keywords won’t be necessary anymore in the context of Copy-on-Write. The proposal can be found here.

Data input / output#

Constructing a DataFrame from values#

In a spreadsheet, values can be typed directly into cells.

A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.

In [3]: df = pd.DataFrame({"x": [1, 3, 5], "y": [2, 4, 6]})

In [4]: df
Out[4]: 
   x  y
0  1  2
1  3  4
2  5  6

Reading external data#

Both Excel and pandas can import data from various sources in various formats.

CSV#

Let’s load and display the tips dataset from the pandas tests, which is a CSV file. In Excel, you would download and then open the CSV. In pandas, you pass the URL or local path of the CSV file to read_csv():

In [5]: url = (
   ...:     "https://raw.githubusercontent.com/pandas-dev"
   ...:     "/pandas/main/pandas/tests/io/data/csv/tips.csv"
   ...: )
   ...: 

In [6]: tips = pd.read_csv(url)
---------------------------------------------------------------------------
ConnectionRefusedError                    Traceback (most recent call last)
File /usr/lib/python3.13/urllib/request.py:1319, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1318 try:
-> 1319     h.request(req.get_method(), req.selector, req.data, headers,
   1320               encode_chunked=req.has_header('Transfer-encoding'))
   1321 except OSError as err: # timeout error

File /usr/lib/python3.13/http/client.py:1338, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
   1337 """Send a complete request to the server."""
-> 1338 self._send_request(method, url, body, headers, encode_chunked)

File /usr/lib/python3.13/http/client.py:1384, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
   1383     body = _encode(body, 'body')
-> 1384 self.endheaders(body, encode_chunked=encode_chunked)

File /usr/lib/python3.13/http/client.py:1333, in HTTPConnection.endheaders(self, message_body, encode_chunked)
   1332     raise CannotSendHeader()
-> 1333 self._send_output(message_body, encode_chunked=encode_chunked)

File /usr/lib/python3.13/http/client.py:1093, in HTTPConnection._send_output(self, message_body, encode_chunked)
   1092 del self._buffer[:]
-> 1093 self.send(msg)
   1095 if message_body is not None:
   1096 
   1097     # create a consistent interface to message_body

File /usr/lib/python3.13/http/client.py:1037, in HTTPConnection.send(self, data)
   1036 if self.auto_open:
-> 1037     self.connect()
   1038 else:

File /usr/lib/python3.13/http/client.py:1472, in HTTPSConnection.connect(self)
   1470 "Connect to a host on a given (SSL) port."
-> 1472 super().connect()
   1474 if self._tunnel_host:

File /usr/lib/python3.13/http/client.py:1003, in HTTPConnection.connect(self)
   1002 sys.audit("http.client.connect", self, self.host, self.port)
-> 1003 self.sock = self._create_connection(
   1004     (self.host,self.port), self.timeout, self.source_address)
   1005 # Might fail in OSs that don't implement TCP_NODELAY

File /usr/lib/python3.13/socket.py:864, in create_connection(address, timeout, source_address, all_errors)
    863 if not all_errors:
--> 864     raise exceptions[0]
    865 raise ExceptionGroup("create_connection failed", exceptions)

File /usr/lib/python3.13/socket.py:849, in create_connection(address, timeout, source_address, all_errors)
    848     sock.bind(source_address)
--> 849 sock.connect(sa)
    850 # Break explicitly a reference cycle

ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

URLError                                  Traceback (most recent call last)
Cell In[6], line 1
----> 1 tips = pd.read_csv(url)

File /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py:1026, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, date_format, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options, dtype_backend)
   1013 kwds_defaults = _refine_defaults_read(
   1014     dialect,
   1015     delimiter,
   (...)
   1022     dtype_backend=dtype_backend,
   1023 )
   1024 kwds.update(kwds_defaults)
-> 1026 return _read(filepath_or_buffer, kwds)

File /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py:620, in _read(filepath_or_buffer, kwds)
    617 _validate_names(kwds.get("names", None))
    619 # Create the parser.
--> 620 parser = TextFileReader(filepath_or_buffer, **kwds)
    622 if chunksize or iterator:
    623     return parser

File /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py:1620, in TextFileReader.__init__(self, f, engine, **kwds)
   1617     self.options["has_index_names"] = kwds["has_index_names"]
   1619 self.handles: IOHandles | None = None
-> 1620 self._engine = self._make_engine(f, self.engine)

File /usr/lib/python3/dist-packages/pandas/io/parsers/readers.py:1880, in TextFileReader._make_engine(self, f, engine)
   1878     if "b" not in mode:
   1879         mode += "b"
-> 1880 self.handles = get_handle(
   1881     f,
   1882     mode,
   1883     encoding=self.options.get("encoding", None),
   1884     compression=self.options.get("compression", None),
   1885     memory_map=self.options.get("memory_map", False),
   1886     is_text=is_text,
   1887     errors=self.options.get("encoding_errors", "strict"),
   1888     storage_options=self.options.get("storage_options", None),
   1889 )
   1890 assert self.handles is not None
   1891 f = self.handles.handle

File /usr/lib/python3/dist-packages/pandas/io/common.py:728, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    725     codecs.lookup_error(errors)
    727 # open URLs
--> 728 ioargs = _get_filepath_or_buffer(
    729     path_or_buf,
    730     encoding=encoding,
    731     compression=compression,
    732     mode=mode,
    733     storage_options=storage_options,
    734 )
    736 handle = ioargs.filepath_or_buffer
    737 handles: list[BaseBuffer]

File /usr/lib/python3/dist-packages/pandas/io/common.py:384, in _get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    382 # assuming storage_options is to be interpreted as headers
    383 req_info = urllib.request.Request(filepath_or_buffer, headers=storage_options)
--> 384 with urlopen(req_info) as req:
    385     content_encoding = req.headers.get("Content-Encoding", None)
    386     if content_encoding == "gzip":
    387         # Override compression based on Content-Encoding header

File /usr/lib/python3/dist-packages/pandas/io/common.py:289, in urlopen(*args, **kwargs)
    283 """
    284 Lazy-import wrapper for stdlib urlopen, as that imports a big chunk of
    285 the stdlib.
    286 """
    287 import urllib.request
--> 289 return urllib.request.urlopen(*args, **kwargs)

File /usr/lib/python3.13/urllib/request.py:189, in urlopen(url, data, timeout, context)
    187 else:
    188     opener = _opener
--> 189 return opener.open(url, data, timeout)

File /usr/lib/python3.13/urllib/request.py:489, in OpenerDirector.open(self, fullurl, data, timeout)
    486     req = meth(req)
    488 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 489 response = self._open(req, data)
    491 # post-process response
    492 meth_name = protocol+"_response"

File /usr/lib/python3.13/urllib/request.py:506, in OpenerDirector._open(self, req, data)
    503     return result
    505 protocol = req.type
--> 506 result = self._call_chain(self.handle_open, protocol, protocol +
    507                           '_open', req)
    508 if result:
    509     return result

File /usr/lib/python3.13/urllib/request.py:466, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
    464 for handler in handlers:
    465     func = getattr(handler, meth_name)
--> 466     result = func(*args)
    467     if result is not None:
    468         return result

File /usr/lib/python3.13/urllib/request.py:1367, in HTTPSHandler.https_open(self, req)
   1366 def https_open(self, req):
-> 1367     return self.do_open(http.client.HTTPSConnection, req,
   1368                         context=self._context)

File /usr/lib/python3.13/urllib/request.py:1322, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
   1319         h.request(req.get_method(), req.selector, req.data, headers,
   1320                   encode_chunked=req.has_header('Transfer-encoding'))
   1321     except OSError as err: # timeout error
-> 1322         raise URLError(err)
   1323     r = h.getresponse()
   1324 except:

URLError: <urlopen error [Errno 111] Connection refused>

In [7]: tips
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 tips

NameError: name 'tips' is not defined

Like Excel’s Text Import Wizard, read_csv can take a number of parameters to specify how the data should be parsed. For example, if the data was instead tab delimited, and did not have column names, the pandas command would be:

tips = pd.read_csv("tips.csv", sep="\t", header=None)

# alternatively, read_table is an alias to read_csv with tab delimiter
tips = pd.read_table("tips.csv", header=None)

Excel files#

Excel opens various Excel file formats by double-clicking them, or using the Open menu. In pandas, you use special methods for reading and writing from/to Excel files.

Let’s first create a new Excel file based on the tips dataframe in the above example:

tips.to_excel("./tips.xlsx")

Should you wish to subsequently access the data in the tips.xlsx file, you can read it into your module using

tips_df = pd.read_excel("./tips.xlsx", index_col=0)

You have just read in an Excel file using pandas!

Limiting output#

Spreadsheet programs will only show one screenful of data at a time and then allow you to scroll, so there isn’t really a need to limit output. In pandas, you’ll need to put a little more thought into controlling how your DataFrames are displayed.

By default, pandas will truncate output of large DataFrames to show the first and last rows. This can be overridden by changing the pandas options, or using DataFrame.head() or DataFrame.tail().

In [8]: tips.head(5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 tips.head(5)

NameError: name 'tips' is not defined

Exporting data#

By default, desktop spreadsheet software will save to its respective file format (.xlsx, .ods, etc). You can, however, save to other file formats.

pandas can create Excel files, CSV, or a number of other formats.

Data operations#

Operations on columns#

In spreadsheets, formulas are often created in individual cells and then dragged into other cells to compute them for other columns. In pandas, you’re able to do operations on whole columns directly.

pandas provides vectorized operations by specifying the individual Series in the DataFrame. New columns can be assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.

In [9]: tips["total_bill"] = tips["total_bill"] - 2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 1
----> 1 tips["total_bill"] = tips["total_bill"] - 2

NameError: name 'tips' is not defined

In [10]: tips["new_bill"] = tips["total_bill"] / 2
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[10], line 1
----> 1 tips["new_bill"] = tips["total_bill"] / 2

NameError: name 'tips' is not defined

In [11]: tips
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 tips

NameError: name 'tips' is not defined

In [12]: tips = tips.drop("new_bill", axis=1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 tips = tips.drop("new_bill", axis=1)

NameError: name 'tips' is not defined

Note that we aren’t having to tell it to do that subtraction cell-by-cell — pandas handles that for us. See how to create new columns derived from existing columns.

Filtering#

In Excel, filtering is done through a graphical menu.

Screenshot showing filtering of the total_bill column to values greater than 10

DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.

In [13]: tips[tips["total_bill"] > 10]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 tips[tips["total_bill"] > 10]

NameError: name 'tips' is not defined

The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with True.

In [14]: is_dinner = tips["time"] == "Dinner"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[14], line 1
----> 1 is_dinner = tips["time"] == "Dinner"

NameError: name 'tips' is not defined

In [15]: is_dinner
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[15], line 1
----> 1 is_dinner

NameError: name 'is_dinner' is not defined

In [16]: is_dinner.value_counts()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 is_dinner.value_counts()

NameError: name 'is_dinner' is not defined

In [17]: tips[is_dinner]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[17], line 1
----> 1 tips[is_dinner]

NameError: name 'tips' is not defined

If/then logic#

Let’s say we want to make a bucket column with values of low and high, based on whether the total_bill is less or more than $10.

In spreadsheets, logical comparison can be done with conditional formulas. We’d use a formula of =IF(A2 < 10, "low", "high"), dragged to all cells in a new bucket column.

Screenshot showing the formula from above in a bucket column of the tips spreadsheet

The same operation in pandas can be accomplished using the where method from numpy.

In [18]: tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[18], line 1
----> 1 tips["bucket"] = np.where(tips["total_bill"] < 10, "low", "high")

NameError: name 'tips' is not defined

In [19]: tips
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 tips

NameError: name 'tips' is not defined

Date functionality#

This section will refer to “dates”, but timestamps are handled similarly.

We can think of date functionality in two parts: parsing, and output. In spreadsheets, date values are generally parsed automatically, though there is a DATEVALUE function if you need it. In pandas, you need to explicitly convert plain text to datetime objects, either while reading from a CSV or once in a DataFrame.

Once parsed, spreadsheets display the dates in a default format, though the format can be changed. In pandas, you’ll generally want to keep dates as datetime objects while you’re doing calculations with them. Outputting parts of dates (such as the year) is done through date functions in spreadsheets, and datetime properties in pandas.

Given date1 and date2 in columns A and B of a spreadsheet, you might have these formulas:

column

formula

date1_year

=YEAR(A2)

date2_month

=MONTH(B2)

date1_next

=DATE(YEAR(A2),MONTH(A2)+1,1)

months_between

=DATEDIF(A2,B2,"M")

The equivalent pandas operations are shown below.

In [20]: tips["date1"] = pd.Timestamp("2013-01-15")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 tips["date1"] = pd.Timestamp("2013-01-15")

NameError: name 'tips' is not defined

In [21]: tips["date2"] = pd.Timestamp("2015-02-15")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 tips["date2"] = pd.Timestamp("2015-02-15")

NameError: name 'tips' is not defined

In [22]: tips["date1_year"] = tips["date1"].dt.year
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 tips["date1_year"] = tips["date1"].dt.year

NameError: name 'tips' is not defined

In [23]: tips["date2_month"] = tips["date2"].dt.month
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 tips["date2_month"] = tips["date2"].dt.month

NameError: name 'tips' is not defined

In [24]: tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 tips["date1_next"] = tips["date1"] + pd.offsets.MonthBegin()

NameError: name 'tips' is not defined

In [25]: tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
   ....:     "date1"
   ....: ].dt.to_period("M")
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[25], line 1
----> 1 tips["months_between"] = tips["date2"].dt.to_period("M") - tips[
      2     "date1"
      3 ].dt.to_period("M")

NameError: name 'tips' is not defined

In [26]: tips[
   ....:     ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
   ....: ]
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 tips[
      2     ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_between"]
      3 ]

NameError: name 'tips' is not defined

See Time series / date functionality for more details.

Selection of columns#

In spreadsheets, you can select columns you want by:

Since spreadsheet columns are typically named in a header row, renaming a column is simply a matter of changing the text in that first cell.

The same operations are expressed in pandas below.

Keep certain columns#

In [27]: tips[["sex", "total_bill", "tip"]]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 tips[["sex", "total_bill", "tip"]]

NameError: name 'tips' is not defined

Drop a column#

In [28]: tips.drop("sex", axis=1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[28], line 1
----> 1 tips.drop("sex", axis=1)

NameError: name 'tips' is not defined

Rename a column#

In [29]: tips.rename(columns={"total_bill": "total_bill_2"})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 tips.rename(columns={"total_bill": "total_bill_2"})

NameError: name 'tips' is not defined

Sorting by values#

Sorting in spreadsheets is accomplished via the sort dialog.

Screenshot of dialog from Excel showing sorting by the sex then total_bill columns

pandas has a DataFrame.sort_values() method, which takes a list of columns to sort by.

In [30]: tips = tips.sort_values(["sex", "total_bill"])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[30], line 1
----> 1 tips = tips.sort_values(["sex", "total_bill"])

NameError: name 'tips' is not defined

In [31]: tips
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 tips

NameError: name 'tips' is not defined

String processing#

Finding length of string#

In spreadsheets, the number of characters in text can be found with the LEN function. This can be used with the TRIM function to remove extra whitespace.

=LEN(TRIM(A2))

You can find the length of a character string with Series.str.len(). In Python 3, all strings are Unicode strings. len includes trailing blanks. Use len and rstrip to exclude trailing blanks.

In [32]: tips["time"].str.len()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[32], line 1
----> 1 tips["time"].str.len()

NameError: name 'tips' is not defined

In [33]: tips["time"].str.rstrip().str.len()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[33], line 1
----> 1 tips["time"].str.rstrip().str.len()

NameError: name 'tips' is not defined

Note this will still include multiple spaces within the string, so isn’t 100% equivalent.

Finding position of substring#

The FIND spreadsheet function returns the position of a substring, with the first character being 1.

Screenshot of FIND formula being used in Excel

You can find the position of a character in a column of strings with the Series.str.find() method. find searches for the first position of the substring. If the substring is found, the method returns its position. If not found, it returns -1. Keep in mind that Python indexes are zero-based.

In [34]: tips["sex"].str.find("ale")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[34], line 1
----> 1 tips["sex"].str.find("ale")

NameError: name 'tips' is not defined

Extracting substring by position#

Spreadsheets have a MID formula for extracting a substring from a given position. To get the first character:

=MID(A2,1,1)

With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that Python indexes are zero-based.

In [35]: tips["sex"].str[0:1]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[35], line 1
----> 1 tips["sex"].str[0:1]

NameError: name 'tips' is not defined

Extracting nth word#

In Excel, you might use the Text to Columns Wizard for splitting text and retrieving a specific column. (Note it’s possible to do so through a formula as well.)

The simplest way to extract words in pandas is to split the strings by spaces, then reference the word by index. Note there are more powerful approaches should you need them.

In [36]: firstlast = pd.DataFrame({"String": ["John Smith", "Jane Cook"]})

In [37]: firstlast["First_Name"] = firstlast["String"].str.split(" ", expand=True)[0]

In [38]: firstlast["Last_Name"] = firstlast["String"].str.rsplit(" ", expand=True)[1]

In [39]: firstlast
Out[39]: 
       String First_Name Last_Name
0  John Smith       John     Smith
1   Jane Cook       Jane      Cook

Changing case#

Spreadsheets provide UPPER, LOWER, and PROPER functions for converting text to upper, lower, and title case, respectively.

The equivalent pandas methods are Series.str.upper(), Series.str.lower(), and Series.str.title().

In [40]: firstlast = pd.DataFrame({"string": ["John Smith", "Jane Cook"]})

In [41]: firstlast["upper"] = firstlast["string"].str.upper()

In [42]: firstlast["lower"] = firstlast["string"].str.lower()

In [43]: firstlast["title"] = firstlast["string"].str.title()

In [44]: firstlast
Out[44]: 
       string       upper       lower       title
0  John Smith  JOHN SMITH  john smith  John Smith
1   Jane Cook   JANE COOK   jane cook   Jane Cook

Merging#

The following tables will be used in the merge examples:

In [45]: df1 = pd.DataFrame({"key": ["A", "B", "C", "D"], "value": np.random.randn(4)})

In [46]: df1
Out[46]: 
  key     value
0   A  0.469112
1   B -0.282863
2   C -1.509059
3   D -1.135632

In [47]: df2 = pd.DataFrame({"key": ["B", "D", "D", "E"], "value": np.random.randn(4)})

In [48]: df2
Out[48]: 
  key     value
0   B  1.212112
1   D -0.173215
2   D  0.119209
3   E -1.044236

In Excel, there are merging of tables can be done through a VLOOKUP.

Screenshot showing a VLOOKUP formula between two tables in Excel, with some values being filled in and others with "#N/A"

pandas DataFrames have a merge() method, which provides similar functionality. The data does not have to be sorted ahead of time, and different join types are accomplished via the how keyword.

In [49]: inner_join = df1.merge(df2, on=["key"], how="inner")

In [50]: inner_join
Out[50]: 
  key   value_x   value_y
0   B -0.282863  1.212112
1   D -1.135632 -0.173215
2   D -1.135632  0.119209

In [51]: left_join = df1.merge(df2, on=["key"], how="left")

In [52]: left_join
Out[52]: 
  key   value_x   value_y
0   A  0.469112       NaN
1   B -0.282863  1.212112
2   C -1.509059       NaN
3   D -1.135632 -0.173215
4   D -1.135632  0.119209

In [53]: right_join = df1.merge(df2, on=["key"], how="right")

In [54]: right_join
Out[54]: 
  key   value_x   value_y
0   B -0.282863  1.212112
1   D -1.135632 -0.173215
2   D -1.135632  0.119209
3   E       NaN -1.044236

In [55]: outer_join = df1.merge(df2, on=["key"], how="outer")

In [56]: outer_join
Out[56]: 
  key   value_x   value_y
0   A  0.469112       NaN
1   B -0.282863  1.212112
2   C -1.509059       NaN
3   D -1.135632 -0.173215
4   D -1.135632  0.119209
5   E       NaN -1.044236

merge has a number of advantages over VLOOKUP:

  • The lookup value doesn’t need to be the first column of the lookup table

  • If multiple rows are matched, there will be one row for each match, instead of just the first

  • It will include all columns from the lookup table, instead of just a single specified column

  • It supports more complex join operations

Other considerations#

Fill Handle#

Create a series of numbers following a set pattern in a certain set of cells. In a spreadsheet, this would be done by shift+drag after entering the first number or by entering the first two or three values and then dragging.

This can be achieved by creating a series and assigning it to the desired cells.

In [57]: df = pd.DataFrame({"AAA": [1] * 8, "BBB": list(range(0, 8))})

In [58]: df
Out[58]: 
   AAA  BBB
0    1    0
1    1    1
2    1    2
3    1    3
4    1    4
5    1    5
6    1    6
7    1    7

In [59]: series = list(range(1, 5))

In [60]: series
Out[60]: [1, 2, 3, 4]

In [61]: df.loc[2:5, "AAA"] = series

In [62]: df
Out[62]: 
   AAA  BBB
0    1    0
1    1    1
2    1    2
3    2    3
4    3    4
5    4    5
6    1    6
7    1    7

Drop Duplicates#

Excel has built-in functionality for removing duplicate values. This is supported in pandas via drop_duplicates().

In [63]: df = pd.DataFrame(
   ....:     {
   ....:         "class": ["A", "A", "A", "B", "C", "D"],
   ....:         "student_count": [42, 35, 42, 50, 47, 45],
   ....:         "all_pass": ["Yes", "Yes", "Yes", "No", "No", "Yes"],
   ....:     }
   ....: )
   ....: 

In [64]: df.drop_duplicates()
Out[64]: 
  class  student_count all_pass
0     A             42      Yes
1     A             35      Yes
3     B             50       No
4     C             47       No
5     D             45      Yes

In [65]: df.drop_duplicates(["class", "student_count"])
Out[65]: 
  class  student_count all_pass
0     A             42      Yes
1     A             35      Yes
3     B             50       No
4     C             47       No
5     D             45      Yes

Pivot Tables#

PivotTables from spreadsheets can be replicated in pandas through Reshaping and pivot tables. Using the tips dataset again, let’s find the average gratuity by size of the party and sex of the server.

In Excel, we use the following configuration for the PivotTable:

Screenshot showing a PivotTable in Excel, using sex as the column, size as the rows, then average tip as the values

The equivalent in pandas:

In [66]: pd.pivot_table(
   ....:     tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average
   ....: )
   ....: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[66], line 2
      1 pd.pivot_table(
----> 2     tips, values="tip", index=["size"], columns=["sex"], aggfunc=np.average
      3 )

NameError: name 'tips' is not defined

Adding a row#

Assuming we are using a RangeIndex (numbered 0, 1, etc.), we can use concat() to add a row to the bottom of a DataFrame.

In [67]: df
Out[67]: 
  class  student_count all_pass
0     A             42      Yes
1     A             35      Yes
2     A             42      Yes
3     B             50       No
4     C             47       No
5     D             45      Yes

In [68]: new_row = pd.DataFrame([["E", 51, True]],
   ....:                        columns=["class", "student_count", "all_pass"])
   ....: 

In [69]: pd.concat([df, new_row])
Out[69]: 
  class  student_count all_pass
0     A             42      Yes
1     A             35      Yes
2     A             42      Yes
3     B             50       No
4     C             47       No
5     D             45      Yes
0     E             51     True

Find and Replace#

Excel’s Find dialog takes you to cells that match, one by one. In pandas, this operation is generally done for an entire column or DataFrame at once through conditional expressions.

In [70]: tips
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[70], line 1
----> 1 tips

NameError: name 'tips' is not defined

In [71]: tips == "Sun"
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[71], line 1
----> 1 tips == "Sun"

NameError: name 'tips' is not defined

In [72]: tips["day"].str.contains("S")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[72], line 1
----> 1 tips["day"].str.contains("S")

NameError: name 'tips' is not defined

pandas’ replace() is comparable to Excel’s Replace All.

In [73]: tips.replace("Thu", "Thursday")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[73], line 1
----> 1 tips.replace("Thu", "Thursday")

NameError: name 'tips' is not defined