` about
that document.
Quick Start
===========
Here's an HTML document I'll be using as an example throughout this
document. It's part of a story from `Alice in Wonderland`::
html_doc = """The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...
"""
Running the "three sisters" document through Beautiful Soup gives us a
``BeautifulSoup`` object, which represents the document as a nested
data structure::
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
#
#
#
# The Dormouse's story
#
#
#
#
#
# The Dormouse's story
#
#
#
# Once upon a time there were three little sisters; and their names were
#
# Elsie
#
# ,
#
# Lacie
#
# and
#
# Tillie
#
# ; and they lived at the bottom of a well.
#
#
# ...
#
#
#
Here are some simple ways to navigate that data structure::
soup.title
# The Dormouse's story
soup.title.name
# u'title'
soup.title.string
# u'The Dormouse's story'
soup.title.parent.name
# u'head'
soup.p
# The Dormouse's story
soup.p['class']
# u'title'
soup.a
# Elsie
soup.find_all('a')
# [Elsie,
# Lacie,
# Tillie]
soup.find(id="link3")
# Tillie
One common task is extracting all the URLs found within a page's tags::
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page::
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
Does this look like what you need? If so, read on.
Installing Beautiful Soup
=========================
If you're using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:
:kbd:`$ apt-get install python3-bs4`
Beautiful Soup 4 is published through PyPi, so if you can't install it
with the system packager, you can install it with ``easy_install`` or
``pip``. The package name is ``beautifulsoup4``. Make sure you use the
right version of ``pip`` or ``easy_install`` for your Python version
(these may be named ``pip3`` and ``easy_install3`` respectively).
:kbd:`$ easy_install beautifulsoup4`
:kbd:`$ pip install beautifulsoup4`
(The ``BeautifulSoup`` package is `not` what you want. That's
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it's still available, but if you're writing new code you
should install ``beautifulsoup4``.)
If you don't have ``easy_install`` or ``pip`` installed, you can
`download the Beautiful Soup 4 source tarball
`_ and
install it with ``setup.py``.
:kbd:`$ python setup.py install`
If all else fails, the license for Beautiful Soup allows you to
package the entire library with your application. You can download the
tarball, copy its ``bs4`` directory into your application's codebase,
and use Beautiful Soup without installing it at all.
I use Python 3.8 to develop Beautiful Soup, but it should work with
other recent versions.
.. _parser-installation:
Installing a parser
-------------------
Beautiful Soup supports the HTML parser included in Python's standard
library, but it also supports a number of third-party Python parsers.
One is the `lxml parser `_. Depending on your setup,
you might install lxml with one of these commands:
:kbd:`$ apt-get install python-lxml`
:kbd:`$ easy_install lxml`
:kbd:`$ pip install lxml`
Another alternative is the pure-Python `html5lib parser
`_, which parses HTML the way a
web browser does. Depending on your setup, you might install html5lib
with one of these commands:
:kbd:`$ apt-get install python-html5lib`
:kbd:`$ easy_install html5lib`
:kbd:`$ pip install html5lib`
This table summarizes the advantages and disadvantages of each parser library:
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| Parser | Typical usage | Advantages | Disadvantages |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not as fast as lxml, |
| | | * Decent speed | less lenient than |
| | | * Lenient (As of Python 3.2) | html5lib. |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency |
| | | * Lenient | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| lxml's XML parser | ``BeautifulSoup(markup, "lxml-xml")`` | * Very fast | * External C dependency |
| | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | |
| | | XML parser | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
| html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow |
| | | * Parses pages the same way a | * External Python |
| | | web browser does | dependency |
| | | * Creates valid HTML5 | |
+----------------------+--------------------------------------------+--------------------------------+--------------------------+
If you can, I recommend you install and use lxml for speed. If you're
using a very old version of Python -- earlier than 3.2.2 -- it's
`essential` that you install lxml or html5lib. Python's built-in HTML
parser is just not very good in those old versions.
Note that if a document is invalid, different parsers will generate
different Beautiful Soup trees for it. See `Differences
between parsers`_ for details.
Making the soup
===============
To parse a document, pass it into the ``BeautifulSoup``
constructor. You can pass in a string or an open filehandle::
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
soup = BeautifulSoup("a web page", 'html.parser')
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters::
print(BeautifulSoup("Sacré bleu!", "html.parser"))
# Sacré bleu!
Beautiful Soup then parses the document using the best available
parser. It will use an HTML parser unless you specifically tell it to
use an XML parser. (See `Parsing XML`_.)
Kinds of objects
================
Beautiful Soup transforms a complex HTML document into a complex tree
of Python objects. But you'll only ever have to deal with about four
`kinds` of objects: ``Tag``, ``NavigableString``, ``BeautifulSoup``,
and ``Comment``.
.. _Tag:
``Tag``
-------
A ``Tag`` object corresponds to an XML or HTML tag in the original document::
soup = BeautifulSoup('Extremely bold', 'html.parser')
tag = soup.b
type(tag)
#
Tags have a lot of attributes and methods, and I'll cover most of them
in `Navigating the tree`_ and `Searching the tree`_. For now, the most
important features of a tag are its name and attributes.
Name
^^^^
Every tag has a name, accessible as ``.name``::
tag.name
# 'b'
If you change a tag's name, the change will be reflected in any HTML
markup generated by Beautiful Soup::
tag.name = "blockquote"
tag
# Extremely bold
Attributes
^^^^^^^^^^
A tag may have any number of attributes. The tag ```` has an attribute "id" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary::
tag = BeautifulSoup('bold', 'html.parser').b
tag['id']
# 'boldest'
You can access that dictionary directly as ``.attrs``::
tag.attrs
# {'id': 'boldest'}
You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary::
tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
#
del tag['id']
del tag['another-attribute']
tag
# bold
tag['id']
# KeyError: 'id'
tag.get('id')
# None
.. _multivalue:
Multi-valued attributes
&&&&&&&&&&&&&&&&&&&&&&&
HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The most common
multi-valued attribute is ``class`` (that is, a tag can have more than
one CSS class). Others include ``rel``, ``rev``, ``accept-charset``,
``headers``, and ``accesskey``. By default, Beautiful Soup parses the value(s)
of a multi-valued attribute into a list::
css_soup = BeautifulSoup('', 'html.parser')
css_soup.p['class']
# ['body']
css_soup = BeautifulSoup('', 'html.parser')
css_soup.p['class']
# ['body', 'strikeout']
If an attribute `looks` like it has more than one value, but it's not
a multi-valued attribute as defined by any version of the HTML
standard, Beautiful Soup will leave the attribute alone::
id_soup = BeautifulSoup('', 'html.parser')
id_soup.p['id']
# 'my id'
When you turn a tag back into a string, multiple attribute values are
consolidated::
rel_soup = BeautifulSoup('Back to the homepage
', 'html.parser')
rel_soup.a['rel']
# ['index', 'first']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# Back to the homepage
You can force all attributes to be parsed as strings by passing
``multi_valued_attributes=None`` as a keyword argument into the
``BeautifulSoup`` constructor::
no_list_soup = BeautifulSoup('', 'html.parser', multi_valued_attributes=None)
no_list_soup.p['class']
# 'body strikeout'
You can use ``get_attribute_list`` to get a value that's always a
list, whether or not it's a multi-valued atribute::
id_soup.p.get_attribute_list('id')
# ["my id"]
If you parse a document as XML, there are no multi-valued attributes::
xml_soup = BeautifulSoup('', 'xml')
xml_soup.p['class']
# 'body strikeout'
Again, you can configure this using the ``multi_valued_attributes`` argument::
class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']
# ['body', 'strikeout']
You probably won't need to do this, but if you do, use the defaults as
a guide. They implement the rules described in the HTML specification::
from bs4.builder import builder_registry
builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES
``NavigableString``
-------------------
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the ``NavigableString`` class to contain these bits of text::
soup = BeautifulSoup('Extremely bold', 'html.parser')
tag = soup.b
tag.string
# 'Extremely bold'
type(tag.string)
#
A ``NavigableString`` is just like a Python Unicode string, except
that it also supports some of the features described in `Navigating
the tree`_ and `Searching the tree`_. You can convert a
``NavigableString`` to a Unicode string with ``str``::
unicode_string = str(tag.string)
unicode_string
# 'Extremely bold'
type(unicode_string)
#
You can't edit a string in place, but you can replace one string with
another, using :ref:`replace_with()`::
tag.string.replace_with("No longer bold")
tag
# No longer bold
``NavigableString`` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but not all of
them. In particular, since a string can't contain anything (the way a
tag may contain a string or another tag), strings don't support the
``.contents`` or ``.string`` attributes, or the ``find()`` method.
If you want to use a ``NavigableString`` outside of Beautiful Soup,
you should call ``unicode()`` on it to turn it into a normal Python
Unicode string. If you don't, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you're
done using Beautiful Soup. This is a big waste of memory.
``BeautifulSoup``
-----------------
The ``BeautifulSoup`` object represents the parsed document as a
whole. For most purposes, you can treat it as a :ref:`Tag`
object. This means it supports most of the methods described in
`Navigating the tree`_ and `Searching the tree`_.
You can also pass a ``BeautifulSoup`` object into one of the methods
defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This
lets you do things like combine two parsed documents::
doc = BeautifulSoup("INSERT FOOTER HEREHere's the footer", "xml")
doc.find(text="INSERT FOOTER HERE").replace_with(footer)
# 'INSERT FOOTER HERE'
print(doc)
#
#
Since the ``BeautifulSoup`` object doesn't correspond to an actual
HTML or XML tag, it has no name and no attributes. But sometimes it's
useful to look at its ``.name``, so it's been given the special
``.name`` "[document]"::
soup.name
# '[document]'
Comments and other special strings
----------------------------------
``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost
everything you'll see in an HTML or XML file, but there are a few
leftover bits. The main one you'll probably encounter
is the comment::
markup = ""
soup = BeautifulSoup(markup, 'html.parser')
comment = soup.b.string
type(comment)
#
The ``Comment`` object is just a special type of ``NavigableString``::
comment
# 'Hey, buddy. Want to buy a used parser'
But when it appears as part of an HTML document, a ``Comment`` is
displayed with special formatting::
print(soup.b.prettify())
#
#
#
Beautiful Soup also defines classes called ``Stylesheet``, ``Script``,
and ``TemplateString``, for embedded CSS stylesheets (any strings
found inside a ``