Getting PO files to work with UTF-8 and POEdit

Wow. That was hard work.
For four hours, I’ve been fighting to transfer my data out of my perfectly-happy mysql translation table, and into a PO file for Django to consume.
Problems are numerous. For a start, I must output to UTF-8. That’s fine.
But linebreaks need to be replaced by chr(11). No, seriously.
Of course, you’ll remember to only escape the double-quotes, not the single-quotes.
Then you need the header:

msgid ""
msgstr ""
"Content-Type: text/plain; charset=utf-8\n"

If you have this header, and you have a single error in your UTF-8, it won’t load in poEdit. You’ll see the error “Failed to convert file contents to Unicode.”
Wow, thanks.
So, thinking it was my corrupted Japanese file, I hunted for ages for this in a hex editor.
Actually, I really did have some invalid UTF-8 characters.
Trick number 1 for fixing is to cut and paste the whole text out and in (windows) – this will fix it.
Real fix is to run it through a routine like this:

function make_utf8($string) {
// From http://w3.org/International/questions/qa-forms-utf-8.html
$result = preg_match_all('%(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*%xs', $string, $matches);
return join($matches[0]);
}

This will drop the dead characters.
Aaaand we’re done. Four hours later.


$locale="en";
mkdir("/foo/bar/" . $locale . "/LC_MESSAGES/", 0777, true);
$handle = fopen("/foo/bar/". $locale. "/LC_MESSAGES/django.po", "wb");
$qry = mysql_query("SELECT * FROM snippet");
fwrite($handle, "\xEF\xBB\xBF"); //This is the BOM - not needed, but often handy
fwrite($handle, "\nmsgid \"\"\nmsgstr \"\"\n\"Content-Type: text/plain; charset=utf-8\\n\"");
while ($row = mysql_fetch_array($qry)) {
$content = str_replace("\r", '', str_replace("\n", chr(11), str_replace('"', '\"', make_utf8($row[$locale]))));
fwrite($handle, "\nmsgid \"" . $row['name'] . "\"\nmsgstr \"" . $content . "\"\n");
}
fclose($handle);

Multiple domains, multiple languages on GAE and Django

I’m using the very excellent app-engine-patch to run Django on Google App Engine.

One requirement is that my sites are multi-lingual.

www.french.com is a French site with its own design.

www.english.com is has different language, different look.

The app-engine-patch provides ragendja’s dynamicsite module – very handy.
This will let me match domain names to sites very easily.
For the language, I have simply amended it – I should override it with a pluggable. This is left for an exercise for the reader.
I created a mapping table:

class SiteLanguage(db.Model):
language = db.StringProperty()
site = db.ReferenceProperty(Site)
def __unicode__(self):
return self.language + '-' + str(self.site)

Then I opened dynamicsite.py, and added the language setter at line 28ish:

#set language appropriately
lang = SiteLanguage.gql('WHERE site = :1', site).get()
if lang:
translation.activate(lang.language)
request.LANGUAGE_CODE = translation.get_language()

with the imports too of course:

from django.utils import translation
from xxx.models import SiteLanguage

and popped it in my admin screens so I can assign sites to languages:

class SiteLanguageAdmin(admin.ModelAdmin):
list_display = ('site','language',)
admin.site.register(SiteLanguage, SiteLanguageAdmin)

Done! 🙂