POST requests and character encoding

Friday 22 January, 2010 @ 14:38

While trying to eliminate all character encoding problems in my Rails application, I stumbled upon the problem of POST requests and their encoding. The problem with these requests is that, when a very basic HTML form is submitted, some browsers do not indicate the character encoding of the data in the request at all. I tested this on Firefox 3.6.

Most of the info I could find on this simply claims that the encoding of the POST request is the same as the page that contained the submitted form. Therefor, if you serve pages as UTF-8, any forms that are submitted back to you will also be in UTF-8.

That may be true, but that doesn’t really help you, if you’re an idealist who wants to treat the HTTP request like the stateless request that it really is. Such as yours truly.

Looking around in specs, there are two methods a browser can use to indicate the character encoding in a POST request:

  • By specifying it in the Content-Type header, such as “application/x-www-form-urlencoded; charset=UTF-8”. It looks like there’s a Mozilla bug from back in 1999, in which this was discussed. Eventually, they didn’t opt for this method because it caused breakage on several HTTP server implementations at the time.

  • For forms that use the application/x-www-form-urlencoded encoding (most forms that don’t do file uploads), a hidden field named ‘_charset_’ can be included. Browsers will override its value on submission with the encoding used. This will be in HTML5, and you can find it in the current draft.

Neither of these methods are handled by Rails or Rack for Ruby 1.9, and all you get is strings with the #encoding set to US-ASCII, while the string actually contains UTF-8. A nice contradiction and source of exceptions elsewhere deep in your application.

I set out to get this sorted in my app, and wrote a monkey-patch. The patch automatically adds the hidden field when using FormHelper, and tries to deal with both that field and the Content-Type header in requests. It’s been briefly tested in Firefox 3.6 only. You can find it in a gist on Github.

Apache locale trickery in Ubuntu

Friday 22 January, 2010 @ 10:31

The default Apache install in Ubuntu, and probably Debian too, contains a config file /etc/apache2/envvars which I have consistently ignored. I can’t remember ever having to deal with environment variables in a web application before.

But now I had to, and not realizing it, I spent a good hour fighting a vague problem from various angles, before I finally made the breakthrough.

This config file contains a line “LANG=C” by default. This has many consequences, but one of them is that all file operations in Ruby 1.9 expect files to be in ASCII encoding, while the rest of the system operates in UTF-8.

I ran into this with a Rails application hosted in Apache with Passenger, and a view containing non-ASCII characters. Ruby’s errors when it encounters incompatible encodings are… very terse.

But I imagine the reason that particular line in that particular config file is really there, because someone else banged his or her head against a wall for a good hour too, because LANG wasn’t C.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
(c) 2010 Shtééf | powered by WordPress with Barecity