Sanitizing text fragments¶
Bleach sanitizes text fragments for use in an HTML context. It provides a
bleach.clean()
function and a more configurable
bleach.sanitizer.Cleaner
class with safe defaults.
Given a text fragment, Bleach will parse it according to the HTML5 parsing algorithm and sanitize tags, attributes, and other aspects. This also handles unescaped characters and unclosed and misnested tags. The result is text that can be used in HTML as is.
Warning
bleach.clean()
is for sanitising HTML fragments to use in an HTML
context–not for use in HTML attributes, CSS, JavaScript, JavaScript
templates (mustache, handlebars, angular, jsx, etc), JSON, xhtml, SVG, or
other contexts.
For example, this is a safe use of clean
output in an HTML context:
<p>
{{ bleach.clean(user_bio) }}
</p>
This is not a safe use of clean
output in an HTML attribute:
<body data-bio="{{ bleach.clean(user_bio) }}">
If you need to use the output of bleach.clean()
in any other context,
you need to pass it through an appropriate sanitizer/escaper for that
context. For example, if you wanted to use the output in an HTML attribute
value, you would need to pass it through Jinja’s or Django’s escape function.
- bleach.clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, protocols=ALLOWED_PROTOCOLS, strip=False, strip_comments=True, css_sanitizer=None)[source]¶
Clean an HTML fragment of malicious content and return it
This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.
This function is not designed to use to transform content to be used in non-web-page contexts.
Example:
import bleach better_text = bleach.clean(yucky_text)
Note
If you’re cleaning a lot of text and passing the same argument values or you want more configurability, consider using a
bleach.sanitizer.Cleaner
instance.- Parameters
text (str) – the text to clean
tags (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
protocols (list) – allowed list of protocols for links; defaults to
bleach.sanitizer.ALLOWED_PROTOCOLS
strip (bool) – whether or not to strip disallowed elements
strip_comments (bool) – whether or not to strip HTML comments
css_sanitizer (CSSSanitizer) – instance with a “sanitize_css” method for sanitizing style attribute values and style text; defaults to None
- Returns
cleaned text as unicode
Allowed Attributes (attributes
)¶
The attributes
kwarg lets you specify which attributes are allowed. The
value can be a list, a callable or a map of tag name to list or callable.
The default value is also a conservative dict found in
bleach.sanitizer.ALLOWED_ATTRIBUTES
.
- bleach.sanitizer.ALLOWED_ATTRIBUTES = {'a': ['href', 'title'], 'abbr': ['title'], 'acronym': ['title']}¶
Map of allowed attributes by tag
Changed in version 2.0: Prior to 2.0, the attributes
kwarg value could only be a list or a map.
As a list¶
The attributes
value can be a list which specifies the list of attributes
allowed for any tag.
For example:
>>> import bleach
>>> bleach.clean(
... '<p class="foo" style="color: red; font-weight: bold;">blah blah blah</p>',
... tags=['p'],
... attributes=['class'],
... )
'<p class="foo">blah blah blah</p>'
As a dict¶
The attributes
value can be a dict which maps tags to what attributes they can have.
You can also specify *
, which will match any tag.
For example, this allows “href” and “rel” for “a” tags, “alt” for the “img” tag and “class” for any tag (including “a” and “img”):
>>> import bleach
>>> attrs = {
... '*': ['class'],
... 'a': ['href', 'rel'],
... 'img': ['alt'],
... }
>>> bleach.clean(
... '<img alt="an example" width=500>',
... tags=['img'],
... attributes=attrs
... )
'<img alt="an example">'
Using functions¶
You can also use callables that take the tag, attribute name and attribute value
and returns True
to keep the attribute or False
to drop it.
You can pass a callable as the attributes argument value and it’ll run for every tag/attr.
For example:
>>> import bleach
>>> def allow_h(tag, name, value):
... return name[0] == 'h'
>>> bleach.clean(
... '<a href="http://example.com" title="link">link</a>',
... tags=['a'],
... attributes=allow_h,
... )
'<a href="http://example.com">link</a>'
You can also pass a callable as a value in an attributes dict and it’ll run for attributes for specified tags:
>>> from urllib.parse import urlparse
>>> import bleach
>>> def allow_src(tag, name, value):
... if name in ('alt', 'height', 'width'):
... return True
... if name == 'src':
... p = urlparse(value)
... return (not p.netloc) or p.netloc == 'mydomain.com'
... return False
>>> bleach.clean(
... '<img src="http://example.com" alt="an example">',
... tags=['img'],
... attributes={
... 'img': allow_src
... }
... )
'<img alt="an example">'
Changed in version 2.0: In previous versions of Bleach, the callable took an attribute name and a attribute value. Now it takes a tag, an attribute name and an attribute value.
Allowed protocols (protocols
)¶
If you allow tags that have attributes containing a URI value (like the href
attribute of an anchor tag, you may want to adapt the accepted protocols.
For example, this sets allowed protocols to http, https and smb:
>>> import bleach
>>> bleach.clean(
... '<a href="smb://more_text">allowed protocol</a>',
... protocols=['http', 'https', 'smb']
... )
'<a href="smb://more_text">allowed protocol</a>'
This adds smb to the Bleach-specified set of allowed protocols:
>>> import bleach
>>> bleach.clean(
... '<a href="smb://more_text">allowed protocol</a>',
... protocols=bleach.ALLOWED_PROTOCOLS + ['smb']
... )
'<a href="smb://more_text">allowed protocol</a>'
Default protocols are in bleach.sanitizer.ALLOWED_PROTOCOLS
.
- bleach.sanitizer.ALLOWED_PROTOCOLS = ['http', 'https', 'mailto']¶
List of allowed protocols
Stripping markup (strip
)¶
By default, Bleach escapes tags that aren’t specified in the allowed tags list and invalid markup. For example:
>>> import bleach
>>> bleach.clean('<span>is not allowed</span>')
'<span>is not allowed</span>'
>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'])
'<b><span>is not allowed</span></b>'
If you would rather Bleach stripped this markup entirely, you can pass
strip=True
:
>>> import bleach
>>> bleach.clean('<span>is not allowed</span>', strip=True)
'is not allowed'
>>> bleach.clean('<b><span>is not allowed</span></b>', tags=['b'], strip=True)
'<b>is not allowed</b>'
Stripping comments (strip_comments
)¶
By default, Bleach will strip out HTML comments. To disable this behavior, set
strip_comments=False
:
>>> import bleach
>>> html = 'my<!-- commented --> html'
>>> bleach.clean(html)
'my html'
>>> bleach.clean(html, strip_comments=False)
'my<!-- commented --> html'
Sanitizing CSS¶
Bleach can sanitize CSS in style attribute values. In order to use this feature, you have to install additional dependencies:
pip install 'bleach[css]'
Bleach provides a bleach.css_sanitizer.CSSSanitizer
class that has
a sanitize:css
method. This takes a style attribute value as text and
returns a sanitized version of that value.
For example:
>>> import bleach
>>> from bleach.css_sanitizer import CSSSanitizer
>>> css_sanitizer = CSSSanitizer(allowed_css_properties=["color", "font-weight"])
>>> tags = ['p', 'em', 'strong']
>>> attrs = {
... '*': ['style']
... }
>>> bleach.clean(
... '<p style="font-weight: heavy;">my html</p>',
... tags=tags,
... attributes=attrs,
... css_sanitizer=css_sanitizer
... )
'<p style="font-weight: heavy;">my html</p>'
Defaults are stored in bleach.css_sanitizer.ALLOWED_CSS_PROPERTIES
and
bleach.css_sanitizer.ALLOWED_SVG_PROPERTIES
.
Note
This silently drops ParseError and AtRule tokens in CSS parsing. If you need
to sanitize style values that have @media
or need to do something with
CSS parse errors, you should implement your own
bleach.css_sanitizer.CSSSanitizer
.
New in version 5.0.
Using bleach.sanitizer.Cleaner
¶
If you’re cleaning a lot of text or you need better control of things, you
should create a bleach.sanitizer.Cleaner
instance.
- class bleach.sanitizer.Cleaner(tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, protocols=ALLOWED_PROTOCOLS, strip=False, strip_comments=True, filters=None, css_sanitizer=None)[source]¶
Cleaner for cleaning HTML fragments of malicious content
This cleaner is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page.
To use:
from bleach.sanitizer import Cleaner cleaner = Cleaner() for text in all_the_yucky_things: sanitized = cleaner.clean(text)
Note
This cleaner is not designed to use to transform content to be used in non-web-page contexts.
Warning
This cleaner is not thread-safe–the html parser has internal state. Create a separate cleaner per thread!
Initializes a Cleaner
- Parameters
tags (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
protocols (list) – allowed list of protocols for links; defaults to
bleach.sanitizer.ALLOWED_PROTOCOLS
strip (bool) – whether or not to strip disallowed elements
strip_comments (bool) – whether or not to strip HTML comments
filters (list) –
list of html5lib Filter classes to pass streamed content through
Warning
Using filters changes the output of
bleach.Cleaner.clean
. Make sure the way the filters change the output are secure.css_sanitizer (CSSSanitizer) – instance with a “sanitize_css” method for sanitizing style attribute values and style text; defaults to None
New in version 2.0.
html5lib Filters (filters
)¶
Bleach sanitizing is implemented as an html5lib filter. The consequence of this
is that we can pass the streamed content through additional specified filters
after the bleach.sanitizer.BleachSanitizerFilter
filter has run.
This lets you add data, drop data and change data as it is being serialized back to a unicode.
Documentation on html5lib Filters is here: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters
Trivial Filter example:
>>> from bleach.sanitizer import Cleaner
>>> from bleach.html5lib_shim import Filter
>>> class MooFilter(Filter):
... def __iter__(self):
... for token in Filter.__iter__(self):
... if token['type'] in ['StartTag', 'EmptyTag'] and token['data']:
... for attr, value in token['data'].items():
... token['data'][attr] = 'moo'
... yield token
...
>>> ATTRS = {
... 'img': ['rel', 'src']
... }
...
>>> TAGS = ['img']
>>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter])
>>> dirty = 'this is cute! <img src="http://example.com/puppy.jpg" rel="nofollow">'
>>> cleaner.clean(dirty)
'this is cute! <img src="moo" rel="moo">'
Warning
Filters change the output of cleaning. Make sure that whatever changes the filter is applying maintain the safety guarantees of the output.
New in version 2.0.
Using bleach.sanitizer.BleachSanitizerFilter
¶
bleach.clean
creates a bleach.sanitizer.Cleaner
which creates a
bleach.sanitizer.BleachSanitizerFilter
which does the sanitizing work.
BleachSanitizerFilter
is an html5lib filter and can be used anywhere you can
use an html5lib filter.
- class bleach.sanitizer.BleachSanitizerFilter(source, allowed_elements=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, allowed_protocols=ALLOWED_PROTOCOLS, strip_disallowed_elements=False, strip_html_comments=True, css_sanitizer=None, **kwargs)[source]¶
html5lib Filter that sanitizes text
This filter can be used anywhere html5lib filters can be used.
Creates a BleachSanitizerFilter instance
- Parameters
source – html5lib TreeWalker stream as an html5lib TreeWalker
allowed_elements (list) – allowed list of tags; defaults to
bleach.sanitizer.ALLOWED_TAGS
attributes (dict) – allowed attributes; can be a callable, list or dict; defaults to
bleach.sanitizer.ALLOWED_ATTRIBUTES
allowed_protocols (list) – allowed list of protocols for links; defaults to
bleach.sanitizer.ALLOWED_PROTOCOLS
strip_disallowed_elements (bool) – whether or not to strip disallowed elements
strip_html_comments (bool) – whether or not to strip HTML comments
css_sanitizer (CSSSanitizer) – instance with a “sanitize_css” method for sanitizing style attribute values and style text; defaults to None