<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dodgeball Cannon &#187; html5lib</title>
	<atom:link href="http://www.johntantalo.com/blog/tag/html5lib/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.johntantalo.com/blog</link>
	<description>It's not so much a time machine as it is my blog.</description>
	<lastBuildDate>Sun, 22 Jan 2012 14:10:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Strip tags with html5lib</title>
		<link>http://www.johntantalo.com/blog/strip-tags-with-html5lib/</link>
		<comments>http://www.johntantalo.com/blog/strip-tags-with-html5lib/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 04:14:37 +0000</pubDate>
		<dc:creator>John Tantalo</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[html5lib]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.johntantalo.com/blog/?p=82</guid>
		<description><![CDATA[There are a couple posts out there that discuss stripping tags with html5lib, but they seem intent on preserving the &#8220;acceptable elements&#8221; such as &#60;span&#62; and &#60;code&#62;. This is fine unless you really want to friggin&#8217; strip out the tags, like I needed for Emend. The following is my solution. Source code for stripping tags [...]]]></description>
			<content:encoded><![CDATA[<p>There are a couple posts <a href="http://code.google.com/p/html5lib/issues/detail?id=62">out</a> <a href="http://deathofagremmie.com/2009/04/12/using-html5lib-to-sanitize-user-input/">there</a> that discuss stripping tags with <a href="http://code.google.com/p/html5lib/"><em>html5lib</em></a>, but they seem intent on preserving the &#8220;acceptable elements&#8221; such as <code>&lt;span&gt;</code> and <code>&lt;code&gt;</code>.</p>
<p>This is fine unless you really want to <em>friggin&#8217; strip out the tags</em>, like I needed for <a href="http://emendapp.com">Emend</a>. The following is my solution.</p>
<p><script src="http://gist.github.com/256684.js?file=strip_tags.py"></script></p>
<p><a href="http://gist.github.com/256684">Source code for stripping tags with html5lib and unit test.</a></p>
<p>For example,</p>
<pre><code>&gt;&gt;&gt; from strip_tags import strip_tags
&gt;&gt;&gt; strip_tags('&lt;p&gt;foo&lt;/p&gt; &lt;script&gt;bar&lt;/script&gt;')
u'foo bar'</code></pre>
<p>Thanks go to <a href="http://edward.oconnor.cx/">Edward O’Connor</a> for pointing me towards <em>html5lib</em> in the first place. It&#8217;s a huge improvement over <a href="http://docs.python.org/library/htmlparser.html"><em>HTMLParser</em></a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.johntantalo.com/blog/strip-tags-with-html5lib/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

