Validating HTML in Ruby with libxml

While moving this site back to rails (edge) from a rake based static site, I added HTML validation to the articles model.

The first step is to install libxml-ruby. This can be done via rubygems: gem install libxml-ruby

To validate if a string is valid html, you will need to wrap it inside a div, otherwise you will get:
parser error : Extra content at the end of the document

parser = XML::Parser.new
parser.string = "<div>#{html}</div>"
parser.parse

If you run the previous code in a IRB session, parser.parse returns an XML::Document even if the document has problems. If the document has problems stderr will contain the errors (pointing to them with a carrot.) In a web app, having the errors go to stderr is probably not what you want to do. To show the errors to the user, capture the errors by creating a custom error handler.

parser = XML::Parser.new
parser.string = "<div>#{self.body}</div>"
msgs = []
XML::Parser.register_error_handler lambda { |msg| msgs << msg }
begin
  parser.parse
rescue Exception => e
  errors.add("body", '<pre>' + msgs.collect{|c| c.gsub('<', '&lt;') }.join + '</pre>')
end

I added a <pre> around the error messages so that they can be presented to the user using the standard helper method error_messages_for. Then adding some css to make the errors fixed width, I get useful error reporting on invalid html.

.errorExplanation pre {
    font-family: monospace;
}

Share/Save/Bookmark

Published

Wed, 29 Aug 2007

View Comments


Want more like this?

Subscribe via RSS
or by email:

New Relic