Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

How SBCS-MBCS-Unicode application interact with Windows

Q. Why a non-Unicode localized application will only run properly if the default system locale matches the application language.

A. Because of the code “page barrier.” Any API call from the application is converted to Unicode using the default system code page.

2006.09.18: Small typo corrected, thanks Olivier Marcoux.

Historically, Windows 95, 98, Me are the 32 bit versions of OS, direct continuation of Windows 3.x. The kernel and all API are code-page oriented. The exposed API is “narrow” or “ANSI,” meaning the strings are char* to a string encoded in the code page of the system.

Windows 2000, XP, 2003 (and Longhorn) are derived from the Windows NT code, witch was developed independently and all the kernel and API is Unicode. The strings are wchar_t* (well, in fact WCHAR* because at the time there was no wchar_t in the standard) and the API exposed is “wide” or “Unicode.”

For the curious, a WCHAR is a WORD, which is an unsigned short, so sizeof(WCHAR) is 2 (with some implications on the supported flavor of Unicode, which I will try to cover later, if someone is interested or if I feel like it :-)).

To simplify the life of the developers and to help migrating to Unicode, Microsoft offers a way to compile the same code as ANSI or Unicode.

All API taking strings as input or output parameters are mapped to the wide or narrow API depending if UNICODE is defined or not.

In reality there is no API called MessageBox.

There is MessageBoxA(HWND, char*, char*, int) and MessageBoxW(HWND, wchar_t*, wchar_t*, int).

(Well, this are not really the signatures, I have “expanded” and the #defines and used wchar_t to make it clear).

Deep in one of the Windows headers (WinUser.h) you can find this piece of code (copyright Microsoft):

#ifdef UNICODE
#define MessageBox  MessageBoxW
#define MessageBox  MessageBoxA
#endif // !UNICODE

Windows 9x was not Unicode, and the only flavor of API implemented was the “A” version. With some exceptions, the “W” API calls are only stabs returning error. Unicode applications cannot run on Windows 9x.

The Unicode versions of Windows (NT and so on) are fully Unicode and implement the “W” versions. But, nice enough, the “A” versions are not stubs, so the ANSI applications can run on Windows NT, 2000, XP, etc., with some limitations.

The “A” calls on NT based windows will convert the strings to Unicode USING THE DEFAULT SYSTEM CODE PAGE (not yelling, just trying to tell this is the important part :-), then call the “W” version. If the function returns a string (like GetWindowText), the Unicode result of the “W” API is converted back to the default system code page, and passed as result of the “A” API.

Right before Windows 9x was almost dead, Michael Kaplan did the reverse for it, writing what is now MSLU (Microsoft Layer for Unicode). So he basically replaced the “W” stubs in Windows 9x with functions that convert the Unicode strings to ANSI (again, using the default system code page) and calls the “A” version of API. If text is returned, it converts it back to Unicode.

The principle is simple, but the huge task of wrapping every API prevented anyone to do it before him :-)

So your Unicode application calls “W” API on Windows 95 and works. But because everything is converted to ANSI code page, the Windows 95 limitations remain (not being able to input/display Japanese on US systems and so on).

See “Generic-Text Mappings in TCHAR.H” MSDN help to learn about ways to write applications that can be compiled both as ANSI and as Unicode.

Also, see “Globalization Step-by-Step” on the Microsoft “Global Development” section.

Moving to Unicode might be time expensive if the original application was not written using generic text mapping. But with the Unicode versions of Windows becoming prevalent, and the burden to support w9x nearly gone, there is no technical reason not to do it.

Here I have only tried to explain why thing are the way they are and why some things cannot be done without moving to Unicode.

I will try to cover the “how” in other articles.

Leave a comment