(none of this is novel, just an improvement to my custom Typo)
Paul Graham's A Plan for Spam fixed what was wrong with email for many of us (and whether Graham's solution was really a Bayes based solution is academic since it worked)
Blogs face many of the same problems, except that these comments degrade our site by taking attention away from our content, takes our valuable time to prune, and can ruin chances of conversations occurring.
My current plan is to use Lucas Carlson's Classifier (available via rubygems.)
Training is be done via the developing Admin section to Typo. In the Comments section, I have added a button to mark a message as spam, using an ajax method to initiate training of the network, then removing the message. Similarly you can click a button to train the network on a message that isn't spam.
Currently I am only training on the text of the message, while perhaps more information should be used as input to the network. Also, the current implementation of classifier strips html, which may be a useful is identifying spammers.
I am using Madeleine snapshots on the filesystem, but it may be preferable to save the network in the database? Perhaps sharing of the spam network or comments will be useful in developing a shared neural network as well.
The system shows guesses as to whether a comment is spam, but while the network is undertrained it will not be very accurate. So currently I manually confirm or reject its guesses. I have experimented with classification of comments when the are added, and not showing them if they are strongly classified as spam. If the classification of a comment is not strongly (the magnitude of the spam classification significantly larger than non-spam classification), I will display it. Either way you need to go into the Comments section to confirm the classifications. I created separate views for spam and non-spam in the Admin section. Also, an admin RSS/XML feeds that contains the comments as well as links that allow easy removal of spam or the confirmation of non-spam.
Should we add this to Typo or is this overkill? (At least it was fun implementing!)
Responses to "A plan for (comment) spam"
Leave a response