I've managed to rewrite my blog again. This time to appengine using web.py.
I started with the demo app Aaron put together the nite appengine was released, and with some pointers from Kragen I was quickly 80% done with the new site. The next 15% involved figuring out how to get reCaptcha and HTML sanitization/cleanup working. Once that was done a few DNS changes and the new site is live.
To get reCaptcha integrated, I started with recaptcha-client 1.0.1, and modified it to use appengine's urlfetch:
def submit(recaptcha_challenge_field,
recaptcha_response_field,
private_key,
remoteip):
"""
Submits a reCAPTCHA request for verification. Returns RecaptchaResponse
for the request
recaptcha_challenge_field -- The value of recaptcha_challenge_field from the form
recaptcha_response_field -- The value of recaptcha_response_field from the form
private_key -- your reCAPTCHA private key
remoteip -- the user's ip address
"""
if not (recaptcha_response_field and recaptcha_challenge_field and
len (recaptcha_response_field) and len (recaptcha_challenge_field)):
return RecaptchaResponse(is_valid = False, error_code = 'incorrect-captcha-sol')
params = {
'privatekey': private_key,
'remoteip' : remoteip,
'challenge': recaptcha_challenge_field,
'response' : recaptcha_response_field,
}
result = urlfetch.fetch(
url = "http://%s/verify" % VERIFY_SERVER,
payload = urlencode(params),
method = urlfetch.POST,
headers = {
"Content-type": "application/x-www-form-urlencoded",
"User-agent": "reCAPTCHA Python/AppEngine"
}
)
if result.status_code == 200:
return_values = result.content.splitlines()
return_code = return_values[0]
if (return_code == "true"):
return RecaptchaResponse(is_valid=True)
else:
return RecaptchaResponse(is_valid=False, error_code = return_values[1])
Grabbing remote IP from web.py via web.ctx['ip']) now allows a simple to query to the reCAPTCHA service to check if you are human.
For HTML sanitization, I used Beautiful Soup. My sanitization code is run when a comment is added (as sanitizing comments when viewing an article caused appengine CPU utilization warnings.) The code is a modification of a django snippet
First I only allow absolute URLs that begin with http[s]:// instead of removing javascript: from the urls (since there are other ways to build bad urls)
absolute_url_matcher = re.compile("^https?://")
def url(URI):
if absolute_url_matcher.match(URI):
return URI
...
tag.attrs = [(attr, val) for attr, val in tag.attrs
if attr in valid_attrs and url(val)]
As comments containing code snippets isn't uncommon, I tweaked how PRE tags are handled:
BeautifulSoup.QUOTE_TAGS['pre'] = None # don't parse inside of PRE tags
...
if tag.name == 'pre':
# convert < into <
tag.replaceWith('<pre>%s</pre>' % tag.contents[0].replace('<', '<'))
Finally I add a BR tag whenever I see two returns to create "paragraphs."
Unfortunately I need to make a few more tweaks as some of the old comments on my blog aren't formated nicely. I always prefer to store both the user's original input and the sanitized version, both so I can re-run the conversion and I can quickly see the offending html if a XSS hole is discovered.
So, why did I do this? I'm a fan of cloud computing, and have used every Amazon Web Service I could find a use/experiment for. While I prefer ruby to python, Google's cloud offering is very enticing, and only by using it can you really know the power/limitations.
