There are a couple posts out there that discuss stripping tags with html5lib, but they seem intent on preserving the “acceptable elements” such as <span> and <code>.

This is fine unless you really want to friggin’ strip out the tags, like I needed for Emend. The following is my solution.

Source code for stripping tags with html5lib and unit test.

For example,

>>> from strip_tags import strip_tags
>>> strip_tags('<p>foo</p> <script>bar</script>')
u'foo bar'

Thanks go to Edward O’Connor for pointing me towards html5lib in the first place. It’s a huge improvement over HTMLParser.

Tagged with:
 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>