Story
Windows 95, 98 and Me are (mostly) gone. Yes, I know they are still lurking in some places, but I development for them (mostly) stopped.
But the ghosts are still here to haunt users and developers: hundreds of thousands of non-Unicode applications.
And very often questions show up in newsgroups about supporting all kind of languages on non-matching operating systems (Japanese on English machines, etc.). And when the only solution is “move to Unicode” the answer is usually “the application is way too big, we cannot afford.”
It looks like the price to convert a lot of code to Unicode is too high. But the price of maintaining non-Unicode applications in a Unicode world, with all kind of hacks to enable some kind of limping support, is higher.
It looks difficult only because many people don’t known what implies. Thing is, the conversion process is relatively simple, with big chinks of it being easy to automate.
How to start
Although the tool automates some of the steps , you still have to understand what is going on, because there is enough work for you too.
Start by reading Michael’s Kaplan “Converting a project to Unicode” (and don’t ignore the comments, which also bring some useful insights):
- Part 0 (The introduction)
- Part 1 (Business before pleasure)
- Part 2 (‘Sorry, you’re not my type.’ ‘Um, maybe I could change that?)
- Part 3 (Can I quote you on that?)
- Part 4 (/Delightful, /Delicious, /DUnicode!)
- Part 5 (Are we there yet? Well, not just yet)
- Part 6 (Upon the road not traveled)
- Part 7 (What does it mean to fit things to a ‘T’, anyway?)
- Part 8 (Fitting MSLU into the mix)
- Part 9 (The project’s postpartum postmortem)
The tool
Developed using GNU Flex in one afternoon (but with some thinking in advance :-).
It is not bullet proof, and it will not magically do everything for you.
But it will:
- replace the big chunk of the CRT API with generic versions
- add
_T
orTEXT
wrappers around strings and characters - “knows” to avoid strings/characters already using
_T
/TEXT
or Unicode (L"Hello"
) - does no changes in comments or inside strings
- gives warnings for several things you will have to fix by hand
The warnings
Code page conversion
First MultiByteToWideChar
and WideCharToMultiByte
: you will have to see if the conversion still need to be done. If the original string was not Unicode and the now it is Unicode, no conversion is required.
But you will probably have to do conversions in order to read legacy files.
And you might discover that some of the buffers you used to move binary data around are now converted to TCHAR
. But it is good practice to use BYTE
for that, not char
(or unsigned char
). So if you used char
, that is really your fault (or the guy who wrote the code, if you are just a maintainer :-)
GetProcAddress
This has two traps:
- It takes an ASCII string as parameter, so it should remain
char *
(orconst char *
, orLPSTR
, or whatever) and the string itself should not be wrapped with_T
/TEXT
. So if the tool already “fixed” it, you will have to undo it. - The name of the function might be that of an ANSI version (stuff like
MessageBoxA
, orLoadLibraryA
, or one of the hundreds of such APIs). In case you might want still be able to compile your application as ANSI you will need to use some conditional code:#ifdef
UNICODE#define
IMPORTAPI_MSGBOX"MessageBoxW"
#else
#define
IMPORTAPI_MSGBOX"MessageBoxA"
#endif
// UNICODE
sizeof
Most APIs will take the length of a buffer in TCHAR
s, but sizeof will give you the size in bytes. So if you used to write into a buffer of char
s, using sizeof(buffer)
was nice and dandy.
Example:
char
buffer[100]; sprintf( buffer,sizeof
(buffer),"Some error: '%s'\n"
, szError );
Not is becomes:
TCHAR buffer[100]; _stprintf( buffer,sizeof
(buffer), _T("Some error: '%s'\n"
), szError );
In first case sizeof(buffer)
is 100, but in the second case it is 100*sizeof(TCHAR)
, meaning 200. This can lead to a buffer overrun, because the count should be expressed in characters, not in bytes.
Anyway, if you read Michael’s posts you already know what the problems are. If you did not, go there and read them, really.
You will need to fix it by using sizeof(buffer)/sizeof(buffer[0])
or sizeof(buffer)/sizeof(TCHAR)
.
Personally I prefer _countof
(not available in older VS versions), which can detect if it is mistakenly applied on pointers instead of arrays.
What I normally do if I think there is any chance I use the code on VS 6:
#ifndef
_countof#define
_countof(arr) (sizeof
(arr)/sizeof
(arr[0]))#endif
Download
Ok, you can download the tool (with the sources) from here.
If you want to compile it yourself, you will need GNU flex 2.5.2 in the tools folder.
I am using the Windows version from Wilbur Streett page), but you can probably use the cygwin version, or whatever you want.
And if you find bugs and you fix them, it would be nice to share that :-)
And just in case it is not obvious, this tool is provided “as is,” no guarantees, no responsibility :-)
Really informative article post.Thanks Again. Awesome.
Thank you!
:-)
There’s another thing that I’d not seen mentioned anywhere until I thought about it, started googling and found this thread.
http://www.tech-archive.net/Archive/VC/microsoft.public.vc.mfc/2007-09/msg00867.html
Visual Studio puts this code at the top of each .cpp file
#ifdef _DEBUG
#define new DEBUG_NEW
#undef THIS_FILE
static char THIS_FILE[] = __FILE__;
#endif
THIS_FILE[] should NOT be changed to a TCHAR.
So how do format stuff so it’s readable? Should I be using HTML tags?
Yes. But the default should also be smarter about line breaks. I have to figure out what do I have to change in configuration.
I know about that (and others :-)), but the “parser” is not smart enough for that, and I don’t feel like writing a full C/C++ parser :-)
I’m just starting on converting our project to Unicode and am finding ToUnicode very helpful, thanks.
I’ve just spent the afternoon re-learning all the stuff about batch files that I forgot years ago and thought the results of this might be useful. Here is a batch file that will run ToUnicode.exe on all the .h and .cpp files in a folder and its sub-folders. Any warning messages that are produced are saved in unicode.log.
By the way, I spent a little while wondering why ToUnicode.exe wasn’t working until I looked at the source code and realised the -T option is case sensitive and -t wouldn’t work. It may be worth mentioning this somewhere.
Regards
– Roger
Thank you.
I have tried to format your comment a bit.
Although I have considered recursively scanning for files, I though it is a bit too dangerous :-)
But your batch might be useful for some (especially since the original file is not lost).
And if I put an update out at some point I might make -t/-text options case insensitive.
Yes, I’m running the batch file on each project and then diffing the results to check what it did. Without doing that it probably could be a bit dangerous.
– Roger