powered by web.py + appengine

I've managed to rewrite my blog again. This time to appengine using web.py.

I started with the demo app Aaron put together the nite appengine was released, and with some pointers from Kragen I was quickly 80% done with the new site. The next 15% involved figuring out how to get reCaptcha and HTML sanitization/cleanup working. Once that was done a few DNS changes and the new site is live.

To get reCaptcha integrated, I started with recaptcha-client 1.0.1, and modified it to use appengine's urlfetch:

def submit(recaptcha_challenge_field,
           recaptcha_response_field,
           private_key,
           remoteip):
    """
    Submits a reCAPTCHA request for verification. Returns RecaptchaResponse
    for the request

    recaptcha_challenge_field -- The value of recaptcha_challenge_field from the form
    recaptcha_response_field -- The value of recaptcha_response_field from the form
    private_key -- your reCAPTCHA private key
    remoteip -- the user's ip address
    """

    if not (recaptcha_response_field and recaptcha_challenge_field and
            len (recaptcha_response_field) and len (recaptcha_challenge_field)):
        return RecaptchaResponse(is_valid = False, error_code = 'incorrect-captcha-sol')

    params = {
        'privatekey': private_key,
        'remoteip' : remoteip,
        'challenge': recaptcha_challenge_field,
        'response' : recaptcha_response_field,
    }

    result = urlfetch.fetch(
        url = "http://%s/verify" % VERIFY_SERVER,
        payload = urlencode(params),
        method = urlfetch.POST,
        headers = {
            "Content-type": "application/x-www-form-urlencoded",
            "User-agent": "reCAPTCHA Python/AppEngine"
        }
    )
    
    if result.status_code == 200:
        return_values = result.content.splitlines()
        return_code = return_values[0]

        if (return_code == "true"):
            return RecaptchaResponse(is_valid=True)
        else:
            return RecaptchaResponse(is_valid=False, error_code = return_values[1])

Grabbing remote IP from web.py via web.ctx['ip']) now allows a simple to query to the reCAPTCHA service to check if you are human.

For HTML sanitization, I used Beautiful Soup. My sanitization code is run when a comment is added (as sanitizing comments when viewing an article caused appengine CPU utilization warnings.) The code is a modification of a django snippet

First I only allow absolute URLs that begin with http[s]:// instead of removing javascript: from the urls (since there are other ways to build bad urls)

absolute_url_matcher = re.compile("^https?://")

def url(URI):
    if absolute_url_matcher.match(URI):
        return URI

...

        tag.attrs = [(attr, val) for attr, val in tag.attrs
                     if attr in valid_attrs and url(val)]

As comments containing code snippets isn't uncommon, I tweaked how PRE tags are handled:

BeautifulSoup.QUOTE_TAGS['pre'] = None  # don't parse inside of PRE tags

...
        if tag.name == 'pre':
            # convert < into &lt;
            tag.replaceWith('<pre>%s</pre>' % tag.contents[0].replace('<', '&lt;'))

Finally I add a BR tag whenever I see two returns to create "paragraphs."

Unfortunately I need to make a few more tweaks as some of the old comments on my blog aren't formated nicely. I always prefer to store both the user's original input and the sanitized version, both so I can re-run the conversion and I can quickly see the offending html if a XSS hole is discovered.

So, why did I do this? I'm a fan of cloud computing, and have used every Amazon Web Service I could find a use/experiment for. While I prefer ruby to python, Google's cloud offering is very enticing, and only by using it can you really know the power/limitations.


Share/Save/Bookmark

Published

Mon, 28 Apr 2008

View Comments


Want more like this?

Subscribe via RSS
or by email:

New Relic