Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

ToUnicode – Automating some of the steps of Unicode code conversion (Windows)

Story

Windows 95, 98 and Me are (mostly) gone. Yes, I know they are still lurking in some places, but I development for them (mostly) stopped.

But the ghosts are still here to haunt users and developers: hundreds of thousands of non-Unicode applications.

And very often questions show up in newsgroups about supporting all kind of languages on non-matching operating systems (Japanese on English machines, etc.). And when the only solution is “move to Unicode” the answer is usually “the application is way too big, we cannot afford.”

It looks like the price to convert a lot of code to Unicode is too high. But the price of maintaining non-Unicode applications in a Unicode world, with all kind of hacks to enable some kind of limping support, is higher.

It looks difficult only because many people don’t known what implies. Thing is, the conversion process is relatively simple, with big chinks of it being easy to automate.

How to start

Although the tool automates some of the steps , you still have to understand what is going on, because there is enough work for you too.

Start by reading Michael’s Kaplan “Converting a project to Unicode” (and don’t ignore the comments, which also bring some useful insights):

The tool

Developed using GNU Flex in one afternoon (but with some thinking in advance :-).

It is not bullet proof, and it will not magically do everything for you.

But it will:

  • replace the big chunk of the CRT API with generic versions
  • add _T or TEXT wrappers around strings and characters
  • “knows” to avoid strings/characters already using _T/TEXT or Unicode (L"Hello")
  • does no changes in comments or inside strings
  • gives warnings for several things you will have to fix by hand

The warnings

Code page conversion

First MultiByteToWideChar and WideCharToMultiByte: you will have to see if the conversion still need to be done. If the original string was not Unicode and the now it is Unicode, no conversion is required.

But you will probably have to do conversions in order to read legacy files.

And you might discover that some of the buffers you used to move binary data around are now converted to TCHAR. But it is good practice to use BYTE for that, not char (or unsigned char). So if you used char, that is really your fault (or the guy who wrote the code, if you are just a maintainer :-)

GetProcAddress

This has two traps:

  1. It takes an ASCII string as parameter, so it should remain char * (or const char *, or LPSTR, or whatever) and the string itself should not be wrapped with _T/TEXT. So if the tool already “fixed” it, you will have to undo it.
  2. The name of the function might be that of an ANSI version (stuff like MessageBoxA, or LoadLibraryA, or one of the hundreds of such APIs). In case you might want still be able to compile your application as ANSI you will need to use some conditional code:
    #ifdef UNICODE
    #define IMPORTAPI_MSGBOX "MessageBoxW"
    #else
    #define IMPORTAPI_MSGBOX "MessageBoxA"
    #endif // UNICODE
    

sizeof

Most APIs will take the length of a buffer in TCHARs, but sizeof will give you the size in bytes. So if you used to write into a buffer of chars, using sizeof(buffer) was nice and dandy.

Example:

char buffer[100];
sprintf( buffer, sizeof(buffer), "Some error: '%s'\n", szError );

Not is becomes:

TCHAR buffer[100];
_stprintf( buffer, sizeof(buffer), _T("Some error: '%s'\n"), szError );

In first case sizeof(buffer) is 100, but in the second case it is 100*sizeof(TCHAR), meaning 200. This can lead to a buffer overrun, because the count should be expressed in characters, not in bytes.

Anyway, if you read Michael’s posts you already know what the problems are. If you did not, go there and read them, really.

You will need to fix it by using sizeof(buffer)/sizeof(buffer[0]) or sizeof(buffer)/sizeof(TCHAR).

Personally I prefer _countof (not available in older VS versions), which can detect if it is mistakenly applied on pointers instead of arrays.

What I normally do if I think there is any chance I use the code on VS 6:

#ifndef _countof
#define _countof(arr) (sizeof(arr)/sizeof(arr[0]))
#endif

Download

Ok, you can download the tool (with the sources) from here.

If you want to compile it yourself, you will need GNU flex 2.5.2 in the tools folder.

I am using the Windows version from Wilbur Streett page), but you can probably use the cygwin version, or whatever you want.

And if you find bugs and you fix them, it would be nice to share that :-)

And just in case it is not obvious, this tool is provided “as is,” no guarantees, no responsibility :-)

9 Comments to “ToUnicode – Automating some of the steps of Unicode code conversion (Windows)”

  1. Johnk745 says:

    Really informative article post.Thanks Again. Awesome.

  2. Roger Bamforth says:

    There’s another thing that I’d not seen mentioned anywhere until I thought about it, started googling and found this thread.

    http://www.tech-archive.net/Archive/VC/microsoft.public.vc.mfc/2007-09/msg00867.html

    Visual Studio puts this code at the top of each .cpp file

    #ifdef _DEBUG
    #define new DEBUG_NEW
    #undef THIS_FILE
    static char THIS_FILE[] = __FILE__;
    #endif

    THIS_FILE[] should NOT be changed to a TCHAR.

    • Roger Bamforth says:

      So how do format stuff so it’s readable? Should I be using HTML tags?

      • Mihai says:

        Yes. But the default should also be smarter about line breaks. I have to figure out what do I have to change in configuration.

    • Mihai says:

      I know about that (and others :-)), but the “parser” is not smart enough for that, and I don’t feel like writing a full C/C++ parser :-)

  3. Roger Bamforth says:

    I’m just starting on converting our project to Unicode and am finding ToUnicode very helpful, thanks.

    I’ve just spent the afternoon re-learning all the stuff about batch files that I forgot years ago and thought the results of this might be useful. Here is a batch file that will run ToUnicode.exe on all the .h and .cpp files in a folder and its sub-folders. Any warning messages that are produced are saved in unicode.log.

    By the way, I spent a little while wondering why ToUnicode.exe wasn’t working until I looked at the source code and realised the -T option is case sensitive and -t wouldn’t work. It may be worth mentioning this somewhere.

    Regards

    – Roger


    ::----------------------------------------------------

    :: Converts all the .h and .cpp files in the current folder and all subfolders to a unicode version.
    :: The unicode version of foo.h is called foo.u.h and similarly the unicode version of foo.cpp is called foo.u.cpp.
    :: Any warnings that the conversion program produces can be found in unicode.log

    :: See http://www.robvanderwoude.com/ntfor.php for info on using the "for" batch command.
    :: See http://shaunedonohue.blogspot.com/2007/09/every-time-i-need-to-redirect-dos.html for info on redirection, including how to redirect stderr to a file.

    @echo off

    set logfile="unicode.log"

    echo Starting Unicode conversion...

    echo ToUnicode warning messages >>%logfile%
    echo -------------------------- >>%logfile%
    echo. >>%logfile%

    :: The 2>> is redirecting stderr to %logfile% instead of stdout.
    (for /R %%i in (*.h *.cpp) do ToUnicode -T "%%i" "%%~dpni.u%%~xi") 2>>%logfile%

    echo. >>%logfile%
    echo ----------------------- >>%logfile%
    echo End of warning messages >>%logfile%

    echo ...finished Unicode conversion

    • Mihai says:

      Thank you.
      I have tried to format your comment a bit.

      Although I have considered recursively scanning for files, I though it is a bit too dangerous :-)
      But your batch might be useful for some (especially since the original file is not lost).

      And if I put an update out at some point I might make -t/-text options case insensitive.

      • Roger Bamforth says:

        Yes, I’m running the batch file on each project and then diffing the results to check what it did. Without doing that it probably could be a bit dangerous.

        – Roger

Leave a comment to Roger Bamforth