Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

What is a “Unicode application”?

Q. What is a “Unicode application”?

A. There is no such thing, grass-hopper!

Intro

There are some blogs I read daily (unless something explodes). One of them is Raymond Chen’s The Old New Thing. The other one is Michel Kaplan’s Sorting It All Out.

But yesterday Raymond published something that is in between the two: On the fuzzy definition of a "Unicode application".

A very interesting problem and I might try to give an opinion at the end of this article, but this is not about it.

The comments moved the topic a bit, and the main debate was around the IME, if it should or not care about the application as a whole or should only care about the control. On one side Norman Diamond, Ben Bryant and myself arguing that it is not such a big problem, while the Raymond and Michael firmly convinced that “if you think its easy then you don’t understand the problem.”

In such cases I normally go away convinced I am right and call it a day :-)
But when on the other side is someone that I respect, I start to doubt.

So I did what I always do in this kind of situation: wrote a small application. Because nothing can make it more clear. It is not about the English language anymore. And this is the ultimate test. Sometimes you think that something is very easy (or very difficult), and only when you try doing it you really discover if is really true.

First, plain ANSI

First I have wrote an ANSI application, directly calling the A versions of all APIs and ostentatiously using char* throughout.
And because I wanted to be in control what is created and how, I don’t have dialogs and nothing comes from resources. No error checking, nothing getting the in way of understanding what is going on.

And here is the code:

#include <windows.h>
#include <stdio.h>

LRESULT CALLBACK    WndProc( HWND, UINT, WPARAM, LPARAM );

int APIENTRY WinMain( HINSTANCE hInstance, HINSTANCE hPrevInstance, char * lpCmdLine, int nCmdShow )
{
    MSG msg;
    const char *szWindowClass = "MiniApp";

    WNDCLASSEX wcex;

    wcex.cbSize        = sizeof(WNDCLASSEX); 
    wcex.style         = CS_HREDRAW | CS_VREDRAW;
    wcex.lpfnWndProc   = (WNDPROC)WndProc;
    wcex.cbClsExtra    = 0;
    wcex.cbWndExtra    = 0;
    wcex.hInstance     = hInstance;
    wcex.hIcon         = NULL;
    wcex.hCursor       = LoadCursor( NULL, IDC_ARROW);
    wcex.hbrBackground = (HBRUSH)(COLOR_WINDOW+1);
    wcex.lpszMenuName  = NULL;
    wcex.lpszClassName = szWindowClass;
    wcex.hIconSm       = NULL;

    RegisterClassExA( &wcex );

    HWND hWnd = CreateWindowA( szWindowClass, "Minimal app", WS_OVERLAPPEDWINDOW, CW_USEDEFAULT, 0, CW_USEDEFAULT, 0, NULL, NULL, hInstance, NULL );

    if( !hWnd )
        return FALSE;

    ShowWindow( hWnd, nCmdShow );
    UpdateWindow( hWnd );

    while( GetMessage( &msg, NULL, 0, 0 ) ) {
        TranslateMessage( &msg );
        DispatchMessage( &msg );
    }

    return 0;
}

#define DIM(a)    (sizeof(a)/sizeof(a[0]))

LRESULT CALLBACK WndProc( HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam ) {
    PAINTSTRUCT ps;
    static HWND hWndEdit;
    static char outbuffer[1024] = {0};
    static char inbuffer[1024] = {0};
    char tmpbuffer[1024] = {0};

    switch( message ) {
        case WM_CREATE:
            hWndEdit = CreateWindowA( "EDIT", "", WS_CHILD | WS_VISIBLE | WS_BORDER, 10, 10, 100, 25, hWnd, NULL, NULL, NULL );
            break;
        case WM_COMMAND:
            GetWindowTextA( hWndEdit, inbuffer, DIM(inbuffer) );
            outbuffer[0] = 0;
            for( int i = 0; inbuffer[i]; i++ ) {
                _snprintf( tmpbuffer, DIM(tmpbuffer), "0x%x ", 0xFF & inbuffer[i] );
                strcat( outbuffer, tmpbuffer );
            }
            InvalidateRect( hWnd, NULL, TRUE );
            break;
        case WM_PAINT:
            BeginPaint( hWnd, &ps );
            TextOutA( ps.hdc, 10, 40, inbuffer, (int)strlen(inbuffer) );
            TextOutA( ps.hdc, 10, 60, outbuffer, (int)strlen(outbuffer) );
            EndPaint( hWnd, &ps );
            break;
        case WM_DESTROY:
            PostQuitMessage( 0 );
            break;
        default:
            return DefWindowProc( hWnd, message, wParam, lParam );
    }

    return 0;
}

Now we run this on an English OS and play with Japanese a bit. Remember, the codes you see below are Shift-JIS, not Unicode.

The results are expected: and looks fine in the IME candidates list, but once committed, we get question marks (because the control is not Unicode).

I have also tried to boot on a Japanese system, and now I can input the controversial 0x5C (the back-slash/narrow Yen), and the wide version (which in Shift-JIS is 81 8F, no surprise). And don’t get fooled, 0xA5 is the Shift-JIS value, which maps to Unicode U+FF65 (HALFWIDTH KATAKANA MIDDLE DOT). There is no mapping for the “real” narrow Yen (U+00A5) in Shift-JIS.

Mixing W and A

Then I have done the minimal changes necessary to create the Edit control as an Unicode control. No changes in the WNDPROC, or in the messages pump.

LRESULT CALLBACK WndProc( HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam ) {
    PAINTSTRUCT ps;
    static HWND hWndEdit;
    static WCHAR outbuffer[1024] = {0};
    static WCHAR inbuffer[1024] = {0};
    WCHAR tmpbuffer[1024] = {0};

    switch( message ) {
        case WM_CREATE:
            hWndEdit = CreateWindowW( L"EDIT", L"", WS_CHILD | WS_VISIBLE | WS_BORDER, 10, 10, 100, 25, hWnd, NULL, NULL, NULL );
            break;
        case WM_COMMAND:
            GetWindowTextW( hWndEdit, inbuffer, DIM(inbuffer) );
            outbuffer[0] = 0;
            for( int i = 0; inbuffer[i]; i++ ) {
                _snwprintf( tmpbuffer, DIM(tmpbuffer), L"0x%x ", 0xFFFF & inbuffer[i] );
                wcscat( outbuffer, tmpbuffer );
            }
            InvalidateRect( hWnd, NULL, TRUE );
            break;
        case WM_PAINT:
            BeginPaint( hWnd, &ps );
            TextOutW( ps.hdc, 10, 40, inbuffer, (int)wcslen(inbuffer) );
            TextOutW( ps.hdc, 10, 60, outbuffer, (int)wcslen(outbuffer) );
            EndPaint( hWnd, &ps );
            break;
        case WM_DESTROY:
            PostQuitMessage( 0 );
            break;
        default:
            return DefWindowProc( hWnd, message, wParam, lParam );
    }

    return 0;
}

We do the same minimal testing and we get (again) the expected results:

This time the committed text is also ok. And the codes below are Unicode code points.

Let’s try some surrogate value.

Works! Cool!

And now the “controversial” Yen on an English system:

and on a Japanese one:

So, except for the way U+005C is displayed (which is a glyph thing), all is right, including the U+00A5.
Yes, there is no way to get it from IME, but this is an IME table/keyboard missing key problem, and has nothing to do with the control or the application being or not Unicode.
It has to do with
this Getting rid of your extra yen
and this I WON to talk about the YEN
and this I’d rather call it the path separator
and this When is a backslash not a backslash?
and this The mission of GIFT :-)

Some thoughts

So, what have I learned (some kind of conclusions :-)):

  • The Yen problem is not simple, as Michel and Raymond pointed out
  • The Yen problem has nothing to do with Unicode application or not
  • The IME has no clue about U+00A5, Unicode application or not
  • The IME deals quite ok with Unicode controls, no matter if the rest of the application is Unicode or not (whatever that means :-))

So, to have the IME working in mixed environment is easy for the regular programmer. It might not be for the one implementing the OS (and then Michel and Raymond are right, it is difficult). But this is the point: they work hard so that we don’t have to :-).

Back to the “Unicode application” thing

One can say that an application making no calls to A versions of Windows API is Unicode. But then, what about CRT functions? Ok, then no A API and no CRT functions taking char and char * (wide versions are fair game). Then, what about having a buffer of char used to import/export plain text files, or to send emails. This makes my application non-Unicode? Or my own functions doing bad string manipulation, like cast from char to WCHAR. So, when everything uses char and char * and the encoding is UTF-8 (UNIX style), is the application Unicode?

In the end, I don’t think we can have a clear definition for “Unicode application.” It is about shades of gray. What is a tall man? I don’t know, it depends. Is tall above 6′? Above 5.11?

Now, don’t use this as an excuse, and show this to your boss: “See, we don’t have to migrate our application to Unicode” :-).
You should be “Unicode enough” (and what that means can be the topic of another debate :-).

The code

Although is not very useful, you can download the sources from here.

One comment to “What is a “Unicode application”?”

  1. bright light says:

    I had the same issue in .net pinvoke . I was calling SetWindowLong instead of SetWindowLongW for textbox. Thanks! this fixed the issue.

Leave a comment to bright light