overstimulate

Validating HTML in Ruby with libxml

Wed, 29 Aug 2007 libxml rails ruby comments

While moving this site back to rails (edge) from a rake based static site, I added HTML validation to the articles model.

The first step is to install libxml-ruby. This can be done via rubygems: gem install libxml-ruby

To validate if a string is valid html, you will need to wrap it inside a div, otherwise you will get:
parser error : Extra content at the end of the document

parser = XML::Parser.new
parser.string = "<div>#{html}</div>"
parser.parse

If you run the previous code in a IRB session, parser.parse returns an XML::Document even if the document has problems. If the document has problems stderr will contain the errors (pointing to them with a carrot.) In a web app, having the errors go to stderr is probably not what you want to do. To show the errors to the user, capture the errors by creating a custom error handler.

parser = XML::Parser.new
parser.string = "<div>#{self.body}</div>"
msgs = []
XML::Parser.register_error_handler lambda { |msg| msgs << msg }
begin
  parser.parse
rescue Exception => e
  errors.add("body", '<pre>' + msgs.collect{|c| c.gsub('<', '&lt;') }.join + '</pre>')
end

I added a <pre> around the error messages so that they can be presented to the user using the standard helper method error_messages_for. Then adding some css to make the errors fixed width, I get useful error reporting on invalid html.

.errorExplanation pre {
    font-family: monospace;
}

Responses to "Validating HTML in Ruby with libxml"

  1. Fri, 31 Aug 2007 Christoffer Sawicki says:
    Your code is actually not validating anything; it's just checking XML well-formedness. I'm using REXML for the same thing currently and am interested in how good the error messages from libxml are. How are they? :)
  2. Fri, 31 Aug 2007 Jesse Andrews says:
    Christoffer, Thanks for the correction. Some of the errors are really useful (and point to exactly where the problem is), although errors such as unmatched tags tend to be obfuscated.

Leave a response

My Card Add to your Address Book

Jesse Andrews
open source, web browsers, web services, web sites & folk dancing. contacts/sites

Keep Up To Date

Get updates via RSS or
get email when I blog

Previous Blog Posts