Update Make sure you read the comments on this post before considering it. In particular, Pete brings up some concerns about applications having data which is already UTF-8, but marked as Latin1 in the database, may cause problems.
So you’ve got this Rails application you’ve been developing and all of a sudden you need to support Unicode. After all, not everybody speaks English. And some really awkward people like all sorts of typographic symbols in their medical articles. In fact, you wouldn’t believe all the weird characters these print-production-oriented people like to use…
Most of the instructions here were gleamed from a jabbering giraffe and the notes I wrote up from his talk. But I like to think I’ve had a bright idea of my own. :-) Note that these instructions assume you’re using Ruby 1.8.x, MySQL >= 5 and edge (soon to be 1.2) rails.
OK, so to get Rails basically talking UTF-8, you have to do a couple of things.
Firstly, make Ruby itself a little bit Unicode-aware, by sticking the following
in config/environment.rb
:
$KCODE = 'u'
We also need to tell ActiveRecord that the connection it should open to MySQL
should be UTF-8 encoded. This is done by putting the following in each of your
database stanzas in config/database.yml
:
encoding: utf8
Finally, from a setup perspective, we need to migrate the current database to one which uses UTF-8 encoding internally. This is what I consider to be my ‘smart’ bit. :-) Create yourself a migration:
script/generate migration make_unicode_friendly
then paste in the following code:
This migrates the current database to using UTF-8 with general, case-insensitive collation, which affects the creation of future tables. It also updates each of the current tables, converting their contents to UTF-8 too.
And it’s reversible. Well, mostly. It makes the assumption that the previous character set you were using was the server’s default (which, unless you explicitly specified a character set/collation upon creation will be the case), and reverts back to that. Of course, a backward migration may well be lossy, so you want to be careful trying that.
The next bit is the tricky one. Most of the Ruby string functions aren’t
Unicode-aware. They’ll quite happily slice
up multi-byte characters.
Fortunately edge rails now extends String
to provide a chars
method which
returns an
ActiveSupport::Multibyte::Chars
object. It
walks like a string and talks like a string, but is multibyte aware. Nice.
Apparently there’s active work going on in the core to get internal Rails stuff
to use this new functionality, so hopefully it should be pretty good soon.
Hopefully it should be good enough for me to use just now…