For a few months back in 2006, we misconfigured something in our rails/db stack and some improperly incoded data got in. Hence, working on userscripts.org on my laptop with a copy of the real database hasn't been possible. I can do a sql dump, but reloading I get:
% psql uso_dev -f dump.sql psql:dump.sql:33795: ERROR: invalid byte sequence for encoding "UTF8": 0xfa HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY scripts, line 54
To help fix this, I wrote a small rake task I use for building a report of records with encoding issues.
def invalid_encodings( model, fields, deleted=false )
report = {}
records = if (deleted)
model.find_with_deleted(:all)
else
model.find(:all)
end
records.each do |record|
fields.each do |field|
begin
record[field].each_char { |char| char.unpack('U') } unless record[field].blank?
rescue
report[ record.id ] ||= []
report[ record.id ] << field
end
end
end
report
end
If you use acts_as_paranoid, you will need to send true for the third parameter, since the deleted records will still be in the dump and can cause problems if not encoded properly.
I'm almost able to use the database on my laptop, I've fixed:
invalid_encodings Tag, [:name] invalid_encodings User, [:display_name, :email, :bio, :login, :website, :login], true invalid_encodings Comment, [:body]
Unfortunately many scripts have encoding issues in their source
invalid_encodings Script, [:summary, :description_extended, :name, :homepage, :location, :src]
=> {7628=>[:src], 5120=>[:src], 5367=>[:src], 5405=>[:src], 4645=>[:src], 7932=>[:src],
5329=>[:src], 3667=>[:src], 4627=>[:src], 3582=>[:src], 4323=>[:src], 4827=>[:src],
3535=>[:src], 5416=>[:src], 7658=>[:src], 4048=>[:src], 4077=>[:src], 3792=>[:src],
7801=>[:src], 7592=>[:src], 4324=>[:src], 7364=>[:src], 4030=>[:src], 7659=>[:src],
3840=>[:src], 5417=>[:src], 5389=>[:src], 4287=>[:src], 4905=>[:src], 3138=>[:src],
4677=>[:src], 4259=>[:src], 3841=>[:src], 5428=>[:src], 5352=>[:src], 3471=>[:src],
4592=>[:src], 7699=>[:src], 3120=>[:src], 3215=>[:src], 7927=>[:src], 4374=>[:src],
4279=>[:src], 4973=>[:src], 2199=>[:src], 5163=>[:src], 4023=>[:src], 4289=>[:src],
3662=>[:src], 7880=>[:src], 3434=>[:src], 4726=>[:src], 4384=>[:src], 4641=>[:src],
5363=>[:src], 4394=>[:src], 4385=>[:src], 4214=>[:src], 3777=>[:src], 3435=>[:src],
4632=>[:src], 3483=>[:src], 4718=>[:src], 7929=>[:src], 7758=>[:src], 7796=>[:src],
2552=>[:src], 4386=>[:src], 3873=>[:src], 3484=>[:src], 5251=>[:src], 4320=>[:src],
7683=>[:src], 7883=>[:src], 4121=>[:src], 4425=>[:src], 3361=>[:src], 2563=>[:src],
7579=>[:src], 5328=>[:src], 3656=>[:src], 5404=>[:src], 7361=>[:src], 5271=>[:src],
3428=>[:src], 4084=>[:src], 4388=>[:src]}
Eventually I'll be able to work on real issues like fixing user facing issues on userscripts such as tags, feeds, ...
Responses to "Report of records with invalid encoding"
Leave a response