Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Mojibake, question marks, and other troubles

Q. I am trying to create a Japanese application and I get corrupted characters/question marks. What am I doing wrong?

A. There are many elements affecting this, but the most important are described below.

Introduction

Maybe one of the most common question asked in newsgroups is the one above.

Usually the answer points to one of the following elements: OS version, OS system code page (or “ANSI code page”), if the application is Unicode or ANSI, how was the resource file encoded and compiled, fonts.

But trying to figure out exactly what is wrong might require a lot of back and forth, with successive questions and answers, working to narrow down the exact problem.

This is where this article is trying to help.

The Application

I am not going to talk much about it. You can download the code and see for yourself.

It is the basic Win32 as created by the Visual Studio 2005 wizard, adapted to load resources from a dll (name provided in the command line).

Test application

The UI tries to cover the most common elements containing text:

  • one menu
  • one message box
  • 3 lines of text (displayed in WM_PAINT using DrawText) with 3 different fonts (DEFAULT_GUI_FONT, “MS Gothic” and “MS ゴシック”)
  • 6 dialogs with 6 fonts in different classes: 2 Western fonts (“Arial” and “Comic Sans MS”), one Japanese font described by it’s English and Japanese name (“MS Gothic” and “MS ゴシック”), and two generic fonts (“MS Shell Dlg” and “MS Shell Dlg 2”).

The resource files use Windows 932 encoding, and I have added 2 Kanji characters (日本 == Nihon == Japan) to all text elements.

The bytes of the 2 Kanji (in cp 932) are “93 FA 96 7B” and interpreted as Win 1252 (Western European) encoding they look like this: “ú–{.

The Unicode values are U+65E5 U+x672C.

The Operating Systems

I have run the tests on Window 98 SE (Second Edition) English and Japanese, and on Windows 2000 SP4, Windows XP SP2, Windows Vista CTP July 2006.

For the Unicode platforms (2000, XP and Vista) all international support was installed (complex script and double byte), and the testing on “Japanese OS” was done in fact on English OS, with the system locales set to Japanese and reboot. I assure you that running this application on the matching “real” Japanese systems will give the same results :-).

The Results

The results in the table are not to be used as something “carved in stone”, but more to illustrate the qualitative results. They will depend on the actual combination of bytes used to represent your Kanji.

The “pragma code_page” column means that the .rc file has the #pragma code_page directive, with the value indicated (duh :-)).

OS Application Type pragma code_page Menu Txt DEFAULT GUI FONT Txt MS Gothic Txt MS ゴシック MessageBox Dlg Arial Dlg Comic Sans MS Dlg MS Gothic Dlg MS ゴシック Dlg MS Shell Dlg Dlg MS Shell Dlg 2
WXP & Vista US ANSI 1252 “ú–{ “ú–{ 日本 日本 “ú–{ “ú–{ “ú–{ “ú–{ ❙ú❙{ “ú–{ “ú–{
932 日本 ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Unicode 1252 “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ ❙ú❙{ “ú–{ “ú–{
932 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本
WXP & Vista Jp ANSI 1252 “ú–{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{
932 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本
Unicode 1252 “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{
932 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本 日本
W2K US ANSI 1252 “ú–{ “ú–{ 日本 日本 “ú–{ “ú–{ “ú–{ “ú–{ ❙ú❙{ “ú–{ “ú–{
932 日本 ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Unicode 1252 日本 “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ ❙ú❙{ “ú–{ “ú–{
932 日本 日本 日本 日本 日本 ❙❙ ❙❙ 日本 日本 日本 日本
W2K Jp ANSI 1252 “ú–{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{ “u?{
932 日本 日本 日本 日本 日本 ❙❙ ❙❙ 日本 日本 日本 日本
Unicode 1252 “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{ “ú–{
932 日本 日本 日本 日本 日本 ❙❙ ❙❙ 日本 日本 日本 日本
W98 US ANSI 1252 ❙ú❙{ ❙ú❙{ “ú–{ “ú–{ ❙ú❙{ “ú–{ “ú–{ ❙ú❙{ ❙ú❙{ ❙ú❙{ ❙ú❙{
932 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Unicode 1252 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
932 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
W98 Jp ANSI 1252 “_ _ _{ “_ _ _{ “_ _ _{ “_ _ _{ “_ _ _{ ⌷g _ _ _{ ⌷g _ _ _{ “_ _ _{ “_ _ _{ ❙g _ _ _{ “_ _ _{
932 日本 日本 日本 日本 日本 “ú–{ “ú–{ 日本 日本 ❙ú❙{ 日本
Unicode 1252 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a
932 n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a n/a

For “funny” characters (missing glyphs and such) I have tried to select characters with glyphs resembling the visual aspect of the result. Do not read into this more than it really is. For instance I have used U+2759 to show a black rectangle looking like the “missing glyph” of a bitmap font.

Just in case your fonts are very different, I have also included some screen shoots:

Correct: 日本 Correct: 日本
Ok, only bad font: ❙❙ Ok, only bad font: ❙❙
Mojibake: ⌷g _ _ _{ Mojibake: ⌷g _ _ _{
Mojibake: ❙ú❙{ Mojibake: ❙ú❙{
Mojibake: “u?{ Mojibake: “u?{
Mojibake: “ú–{ Mojibake: “ú–{
Mojibake: “_ _ _{ Mojibake: “_ _ _{
Runtime cvt: _ _ Runtime cvt: _ _
Runtime cvt: ?? Runtime cvt: ??
Not tested: n/a I have not created Unicode versions of the application to run on Windows 98 (using MSLU). I am quite sure they will have the same limitations as the ANSI versions running on Win 98. Maybe at some point I will do it, but I doubt :-).

Some Observations

  • using #pragma_codepage(932) is vital. Instead, you can use /c in the command line of rc.exe. If both the pragma and the /c exist, the pragma directive overrides the command line.
  • an ANSI application only works if the language of the localization matches the code page of the host system (ie Japanese on Japanese OS, Korean on Korean OS). You can mix and match applications/OSes using the same code page (ie Romanian on Polish system, both using 1250).
  • using “generic fonts” on old Windows versions (9x, Me) is not quite safe.
  • the menu on Unicode systems is correct way more often that other elements (There is a good reason for this. If you cannot figure it out ask, and I might come up with another article :-)).
  • compare “W2K+Unicode Application+pragma_codepage(932)” with “WXP+Unicode Application+pragma_codepage(932)” to see the improvements in font linking between 2000 and XP.
  • in this area there are no major changes in Vista vs. XP.

Leave a comment