overstimulate

Report of records with invalid encoding

For a few months back in 2006, we misconfigured something in our rails/db stack and some improperly incoded data got in. Hence, working on userscripts.org on my laptop with a copy of the real database hasn't been possible. I can do a sql dump, but reloading I get:

% psql uso_dev -f dump.sql
psql:dump.sql:33795: ERROR:  invalid byte sequence for encoding "UTF8": 0xfa
HINT:  This error can also happen if the byte sequence does not match the encoding 
expected by the server, which is controlled by "client_encoding".
CONTEXT:  COPY scripts, line 54

To help fix this, I wrote a small rake task I use for building a report of records with encoding issues.

def invalid_encodings( model, fields, deleted=false )
  report = {}
  records = if (deleted)
    model.find_with_deleted(:all)
  else
    model.find(:all)
  end
  records.each do |record|
    fields.each do |field|
      begin
        record[field].each_char { |char| char.unpack('U') } unless record[field].blank?
      rescue
        report[ record.id ] ||= []
        report[ record.id ] << field
      end
    end
  end
  report
end

If you use acts_as_paranoid, you will need to send true for the third parameter, since the deleted records will still be in the dump and can cause problems if not encoded properly.

I'm almost able to use the database on my laptop, I've fixed:

invalid_encodings Tag, [:name]
invalid_encodings User, [:display_name, :email, :bio, :login, :website, :login], true
invalid_encodings Comment, [:body]

Unfortunately many scripts have encoding issues in their source

invalid_encodings Script, [:summary, :description_extended, :name, :homepage, :location, :src]
=> {7628=>[:src], 5120=>[:src], 5367=>[:src], 5405=>[:src], 4645=>[:src], 7932=>[:src], 
5329=>[:src], 3667=>[:src], 4627=>[:src], 3582=>[:src], 4323=>[:src], 4827=>[:src], 
3535=>[:src], 5416=>[:src], 7658=>[:src], 4048=>[:src], 4077=>[:src], 3792=>[:src], 
7801=>[:src], 7592=>[:src], 4324=>[:src], 7364=>[:src], 4030=>[:src], 7659=>[:src], 
3840=>[:src], 5417=>[:src], 5389=>[:src], 4287=>[:src], 4905=>[:src], 3138=>[:src], 
4677=>[:src], 4259=>[:src], 3841=>[:src], 5428=>[:src], 5352=>[:src], 3471=>[:src], 
4592=>[:src], 7699=>[:src], 3120=>[:src], 3215=>[:src], 7927=>[:src], 4374=>[:src], 
4279=>[:src], 4973=>[:src], 2199=>[:src], 5163=>[:src], 4023=>[:src], 4289=>[:src], 
3662=>[:src], 7880=>[:src], 3434=>[:src], 4726=>[:src], 4384=>[:src], 4641=>[:src], 
5363=>[:src], 4394=>[:src], 4385=>[:src], 4214=>[:src], 3777=>[:src], 3435=>[:src], 
4632=>[:src], 3483=>[:src], 4718=>[:src], 7929=>[:src], 7758=>[:src], 7796=>[:src], 
2552=>[:src], 4386=>[:src], 3873=>[:src], 3484=>[:src], 5251=>[:src], 4320=>[:src], 
7683=>[:src], 7883=>[:src], 4121=>[:src], 4425=>[:src], 3361=>[:src], 2563=>[:src], 
7579=>[:src], 5328=>[:src], 3656=>[:src], 5404=>[:src], 7361=>[:src], 5271=>[:src], 
3428=>[:src], 4084=>[:src], 4388=>[:src]}

Eventually I'll be able to work on real issues like fixing user facing issues on userscripts such as tags, feeds, ...


Responses to "Report of records with invalid encoding"

Leave a response

My Card Add to your Address Book

Jesse Andrews
open source, web browsers, web services, web sites & folk dancing. contacts/sites

Keep Up To Date

Get updates via RSS or
get email when I blog

Previous Blog Posts