overstimulate

A plan for (comment) spam

Mon, 18 Apr 2005 comments

(none of this is novel, just an improvement to my custom Typo)

Paul Graham's A Plan for Spam fixed what was wrong with email for many of us (and whether Graham's solution was really a Bayes based solution is academic since it worked)

Blogs face many of the same problems, except that these comments degrade our site by taking attention away from our content, takes our valuable time to prune, and can ruin chances of conversations occurring.

My current plan is to use Lucas Carlson's Classifier (available via rubygems.)

Training is be done via the developing Admin section to Typo. In the Comments section, I have added a button to mark a message as spam, using an ajax method to initiate training of the network, then removing the message. Similarly you can click a button to train the network on a message that isn't spam.

Currently I am only training on the text of the message, while perhaps more information should be used as input to the network. Also, the current implementation of classifier strips html, which may be a useful is identifying spammers.

I am using Madeleine snapshots on the filesystem, but it may be preferable to save the network in the database? Perhaps sharing of the spam network or comments will be useful in developing a shared neural network as well.

The system shows guesses as to whether a comment is spam, but while the network is undertrained it will not be very accurate. So currently I manually confirm or reject its guesses. I have experimented with classification of comments when the are added, and not showing them if they are strongly classified as spam. If the classification of a comment is not strongly (the magnitude of the spam classification significantly larger than non-spam classification), I will display it. Either way you need to go into the Comments section to confirm the classifications. I created separate views for spam and non-spam in the Admin section. Also, an admin RSS/XML feeds that contains the comments as well as links that allow easy removal of spam or the confirmation of non-spam.

Should we add this to Typo or is this overkill? (At least it was fun implementing!)


Responses to "A plan for (comment) spam"

  1. Mon, 18 Apr 2005 Michael Schubert says:
    Definitely *not* overkill. I had a WP blog up for 3 months before being discovered by spammers... I then had 1-3 spam comments per entry dating back the entire 3 months. Sure WP and others have some form of spam protection (there is an article about 6-7 different ways and their effectiveness out there) but using a Bayesian classifier is the *SMART* way to do spam filtering. You can train your blog to automatically mark certain words as spam... because if I have a blog and I write about viagra or refinancing my house.. I want to train my blog to not consider those words spam just because everyone else on the internet doesn't blog about them. This is a good way to keep things simple... the only possible feature I can think of would be an ActionMailer method to fire off suspect comments to the admin of the blog with a unique expirable URL that simply includes an html-escaped version of the comment and a "click here if this comment is not spam" and "click here if it is, and to remove it" and so I can moderate my blog without having to log into the admin interface everytime I want to moderate quotes.
  2. Mon, 18 Apr 2005 Michael Schubert says:
    ... FYI You need to fix the AJAX posting of your comments... the FTY (fade to yellow) fades from yellow to white... white on white text... not a smart thing. :-) Probably need to muck around in prototype.js and tweak it (perhaps we should patch the effect to have a fade from to color so we get a FAT.. Fade to AnyThing)
  3. Mon, 18 Apr 2005 Jesse says:
    Michael, Great ideas, we should use the articles in training non-spam classification automatically! The mailing of suspects off is a good idea as well, and as you say will be easy with ActionMailer. Also, I hadn't noticed about the FTY messing up. I'll fix that soon!

Leave a response

My Card Add to your Address Book

Jesse Andrews
open source, web browsers, web services, web sites & folk dancing. contacts/sites

Keep Up To Date

Get updates via RSS or
get email when I blog

Previous Blog Posts