Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

String API and internationalization

Q. What string class/API is best for internationalization? Should I use CString, iostream, std::string?

A. As always, the answer is “It depends :-)” Here are some considerations.

Introduction

C++ die-hards might complain and claim there is nothing to consider here. iostream is definitely the way to go because is type-safe (unlike the other options), is part of the standard, so is cross-platform, all is nice and dandy. But is a mess to use for internationalization.

But before attacking me on this, read the two topics on string formatters in Herb Sutter’s “Exceptional C++ Style” (http://www.amazon.com/gp/product/0201615622/), and you will see the problem is not that simple, even without internationalization :-).
Sutter compares sprintf, snprintf, stringstream, strstream, boost::lexical_cast (he does not care about platform api :-)) No clue why he ignores boost::format (maybe it was not out at the time?).

His analysis takes 16 pages, in two parts, so you can imagine it is quite serious. He looks at ease of use, code clarity, standard or not (C90, C99, C++03, C++0x), efficient, length-safe, type-safe, usable with templates. Guess what: iostream is not the clear winner.
And he does not even look at internationalization!

His conclusion:

  • to convert a value a to a string => boost::lexical_cast
  • simple formatting => stringstream, strstream (but notes that “the code will be more verbose and harder to grasp”)
  • for more complex formatting => snprintf
  • never sprintf

Now I will try to look at several string APIs from the internationalization perspective.

The typical operations that are important from this perspective are:

  • loading a string from resources (or message catalog, or whatever). This is not a major problem, and is platform-dependent anyway, but writing a small function for this should be trivial. So if the class is cross-platform and has nothing already in place for this task, I will assume you use something called GetTranslation, and I don’t care how you implement it.
  • replace some parameters with current values
  • display the result (again, platform dependent, I will not deal with it)

As an example I will use a simple string with two parameters, something like this:

"You should click Ok to %s %d files"

I will also see how the API behaves when dealing with some typical localization issues:

Spaces: there are languages that do not use spaces to separate words, or use them only in some situations (Thai, Japanese). It looks like another piece of text, so one can just move it to resources and call it a day, isn’t it? Well, no, because depending of the context the English space is “translated” as space or as nothing, and since you have no way to know the rules for each language, you will have to store in resources all spaces, with IDs clear enough for the translator to figure out where it belongs (to what sentence and where). So I will look at this separately.

Yoda speak: there are languages that require the order of the various speech parts to be changed (see “Linguistic typology” at Wikipedia, http://en.wikipedia.org/wiki/Linguistic_typology). But because I don’t want to single out any language out there, I will call it Yoda-speak, and make it sound like this:

"FILES %d TO %s CLICK OK YOU SHOULD"

It is important to note the %d and %s switching position (which can crash your application).

Context (complete string): the strings stored in resources should be as close to complete sentences as possible. It is almost impossible to translate fragments of sentences with no context at all. And no, it is not a solution to provide the running software to the translator. Searching in all the UI for some bits and pieces of strings is slow and error prone.

Format control: sometimes the formatting of some elements (date, numbers) are locale sensitive, so it would be nice to have control over things like decimal/hex display or to change the number of decimal digits. For instance currency values need 2 decimals in most locales, but 3 digits are used for most Arab countries.

Readability: how clear and easy it is to read and understand the code.

Cross-platform: usually part of the standard library also means cross-platform. There is not much to say here, just yes-no.

Unicode: is there is a Unicode equivalent in the family of functions?

The APIs

iostream (strstream)

Original code:

std::string action = "delete";
int ncount = 12;
cout << "You should click Ok to " << action << " " << ncount << " files";

Resources:

IDS_CLICKOK = "You should click Ok to "
IDS_ACTION_DELETE = "delete"
IDS_ACTION_FILES = "files"

International code:

std::string s1, s2, s3, s4;
GetTranslation( s1, IDS_CLICKOK );
GetTranslation( s2, IDS_ACTION_DELETE );
GetTranslation( s3, IDS_ACTION_FILES );
cout << s1 << s2 << " " << %d << s3;

Spaces: there is no way to remove the space, so this is wrong.

Yoda speak: the order changes completely, so this is wrong.

Context: the translator should translate bits and pieces of a sentence, quite a mess.

Format control: printing the number in hex, or controlling precision, make the reading even more difficult and cannot be changed by localization.

Readability: this is ugly and difficult to read.

printf (snprintf)

Original code:

printf( "You should click Ok to %s %d files", action, ncount );

Resources:

IDS_CLICKTOACT = "You should click Ok to %s %d files"
IDS_ACTION_DELETE = "delete"

International code:

std::string msg, action;
GetTranslation( msg, IDS_CLICKTOACT );
GetTranslation( action, IDS_ACTION_DELETE );
printf( msg.c_str(), action.c_str(), ncount );

Spaces: good. The spaces are part of the string, so the translator can remove them.

Yoda speak: the order changes completely, so this is wrong.

Context: the translator should translate almost full sentence, good.

Format control: is easy to control the format because the control flags are part of the string.

Readability: very good. In fact this is one of the main things for the printf family :-).

To use snprintf is even more difficult. One should allocate the target buffer, which now rise the problem of the ownership (who is going to free it). Sure, you can allocate, snprintf, assign the result to a std::string and then free the buffer, so mush work! And in fact estimating the size of the target buffer might not be such a simple task.

CString::Format

Original code:

CString msg;
msg.Format( "You should click Ok to %s %d files", action, ncount );

Resources:

IDS_CLICKTOACT = "You should click Ok to %s %d files"
IDS_ACTION_DELETE = "delete"

International code:

CString msg;
CString action( IDS_ACTION_DELETE );
msg.Format( IDS_CLICKTOACT, action, ncount );

Spaces: good. The spaces are part of the string, so the translator can remove them.

Yoda speak: the order changes completely, so this is wrong.

Context: the translator should translate almost full sentence, good.

Format control: is easy to control the format because the control flags are part of the string.

Readability: very good. In fact this is one of the main things for the printf family :-).

This looks pretty much like the old printf, but compared with snprintf the memory management is easier (in fact, there is nothing you have to do)
Loading from resources is also easy, because the class is platform specific (Windows) and can afford to know about such things.

Overall, a good replacement for snprintf.

CString::FormatMessage

Original code:

CString msg;
msg.FormatMessage( "You should click Ok to %1!s! %2!d! files", action, ncount );

Resources:

IDS_CLICKTOACT = "You should click Ok to %1!s! %2!d! files"
IDS_ACTION_DELETE = "delete"

International code:

CString msg;
CString action( IDS_ACTION_DELETE );
msg.FormatMessage( IDS_CLICKTOACT, action, ncount );

Spaces: good. The spaces are part of the string, so the translator can remove them.

Yoda speak: this is the first one that can solve the Yoda-speak problem.
The translation looks like this:

IDS_CLICKTOACT = "FILES %2!d! TO %1!s! CLICK OK YOU SHOULD"
IDS_ACTION_DELETE = "DELETE"

and the above code works without any kind of changes. Nice!

Context: the translator should translate almost full sentence, good.

Format control: is easy to control the format because the control flags are part of the string.

Readability: very good. Might take a bit to get used to all the exclamation marks, but still good.

This has all the advantages of CString::Format plus some. So the best thing to do is to use CString::FormatMessage throughout.

It is also notable how easy is to use string IDs, without extra steps to load from resources (the advantage of being Windows specific).

boost::format

Original code:

cout << boost::format( "You should click Ok to %1$s %2$d files" ) % action % ncount;

Resources:

IDS_CLICKTOACT = "You should click Ok to %1$s %2$d files"
IDS_ACTION_DELETE = "delete"

International code:

std::string msg, action;
GetTranslation( msg, IDS_CLICKTOACT );
GetTranslation( action, IDS_ACTION_DELETE );
cout << boost::format( msg ) % action % ncount;

Spaces: good. The spaces are part of the string, so the translator can remove them.

Yoda speak: it can properly handle the changes in the parameter’s order, so is ok.

Context: the translator should translate almost full sentence, good.

Format control: is easy to control the format because the control flags are part of the string.

Readability: very good. It might take a bit to get used to all the percents signs, but still good readability.
As good as CString::FormatMessage, especially if you have to be cross-platform (and if you don’t care about the funny % between parameters :-))

It is also type-safe and all, so if you want “C++ purity”, then you can go with boost format library.

The only minor drawback is that there is no function to load resources strings (but this is a C++ limitation), but is easy to implement.

International Components for Unicode (ICU)

Original code:

UErrorCode errCode = U_ZERO_ERROR;
UnicodeString outString;

Formattable arguments[] = { "delete", 12 };
MessageFormat::format( "You should click Ok to {0} {1} files",
	arguments, 2, outString, errCode );
// now you can use outString for output

Resources:

IDS_CLICKTOACT = "You should click Ok to {0} {1} files"
IDS_ACTION_DELETE = "delete"

International code:

UErrorCode errCode = U_ZERO_ERROR;
UnicodeString outString;

UnicodeString action = resBundle.getStringEx( IDS_ACTION_DELETE, errCode );
UnicodeString msg = resBundle.getStringEx( IDS_CLICKTOACT, errCode );

Formattable arguments[2];
arguments[0] = action;
arguments[1] = 12;

MessageFormat::format( msg, arguments, 2, outString, errCode );
// now you can use outString for output

Spaces: good. The spaces are part of the string, so the translator can remove them.

Yoda speak: it can properly handle the changes in the parameter’s order, so is ok.

Context: the translator should translate almost full sentence, good.

Format control: now way to control the format, because the control flags are not part of the string.

Readability: very good. The format will look very familiar to Java programmers :-).

Note also the cross-platform resource bundle API that is part of ICU.

ICU is the main cross-platform internationalization library, offering way more than string formatting. And if C++ is not what you want, ICU also exposes plain C API.

Other notes

std::string internal buffer access

Loading strings from resources is not difficult, but for std::string is not really elegant.

The main problem is the lack of write access to the internal buffer.

One should retrieve the size of the resource string, allocate a char (or wchar_t) array for it, read in that buffer, assign the value to the target std::string and free the buffer.

It would be nice to have a pair of members allowing a contractual access to a read-write buffer:

std::string str;
str.get_write_buffer( size_type size );
// write in the buffer, taking care not to exceed size and to end with zero
str.release_write_buffer();

I know, the internal implementation of std::string can decide on a non-contiguous buffer and so on. But there are already members that guarantee a contiguous memory block (std::string::data and std::string::c_str) which might already means the class should allocate that block, at least temporarily.

wchar_t is not really Unicode

This is something that affects all standard of API (iostream, printf, boost::format).

According to the standard, the size of wchar_t is not specified at all.

In fact it can even be a byte (not very useful for Unicode). And as a result Windows decided for a 16 bit wchar_t, while most UNIX-es went for a 32 bit wchar_t.

The second problem is that nowhere spelled out that the data is Unicode (and there are libraries out there that use wchar_t to store non-Unicode, multi-byte characters).

Add to the mix the fact that many libraries use UTF-16 (16 bit) encoding (i.e. ICU, Qt, Xerces, Mac OS X API, Windows API, Java) and you have a big problem mixing them with standard C API using 32 bit wchar_t.

Summary

  iostream
(strstream)
printf
(snprintf)
CString::
Format
CString::
FormatMessage
boost::
format
ICU
Spaces Bad Good Good Good Good Good
Yoda speak Bad Bad Bad Good Good Good
Context Bad Good Good Good Good Good
Format control Bad Good Good Good Good Bad
Readability Bad Good Good Good Good Good
Cross-platform Good Good Bad Bad Good Good
Unicode Good
templates, instantiate with wchar_t
Good
wprintf ok, but _snwprintf is not standard
Good
CStringW or CStringT instantiated with wchar_t
Good
CStringW or CStringT instantiated with wchar_t
Good
templates, instantiate with wchar_t
Good

7 Comments to “String API and internationalization”

  1. Joe says:

    You should’ve included type safety as a criteria for comparison, too. Type safety is extremely important for internationalization, because if your application will crash at random times because the translator messed up the format specifiers in some rare message, that is really not nice.

    Type Safety
    iostream: good
    printf: bad
    CString::Format: bad
    CString::FormatMessage: bad
    boost::format: good
    ICU: good

    • Mihai says:

      That part should be taken care of in the localization process.
      Any decent localization tool will check that the placeholders in English also exist in the localized string.
      So I am not going to “sneak” this argument back in the discussion :-)
      (and it is really nicely covered by Sutter)

  2. Joe says:

    From Herb Sutter : “However, current ISO C++ does require &str[0] to cough up a pointer to contiguous string data (but not necessarily null-terminated!), so there wasn’t much leeway for implementers to have non-contiguous strings, anyway.” http://herbsutter.com/2008/04/07/cringe-not-vectors-are-guaranteed-to-be-contiguous/#comment-483

    Yes, std::string has a minimal interface and that’s certainly for worse.I don’t know if you’ve been involved in it but there’s was recently an interesting discussion (spread on too many threads) on boost mailing list about what should be a modern string class (unicode,i18n…)

    http://lists.boost.org/Archives/boost/2011/01/176046.php
    http://lists.boost.org/Archives/boost/2011/01/175554.php

    • Mihai says:

      You are right that by promising that data is contigous (what I have also hinted by saying “members that guarantee a contiguous memory block”) all implementations are prevented from using non-contiguous storage. It just feels bad that this feature is “by mistake” :-)

      Thanks for the links to the boost thread. I was not involved, and I will try and catch up.
      What pushed me away from boost before was the insistence that the string should be encoding-agnostic.
      That is just antiquated, hampers a good implementation and might lead to poor design, with clunky signatures that are there “just in case someone invents something better than Unicode” :-)

      In fact, C++0x with u16string and u32string (and might also be nice a u8string) makes an almost official recognition of Unicode, non-format-agnostic string types. A good move, I think. What is missing in top of that would be a set of Unicode-aware methods.

      But let’s not start a blog entry in the comments area :-)

  3. Joe says:

    For “std::string internal buffer access” use &str[0]. Next standard will require contiguous storage for std::string, and right now there’s no major implementation of std::string that use a none contiguous implementation.

    • Mihai says:

      True, but not standard yet :-)
      I find the CString explicit GetBuffer/ReleaseBuffer calls better than a hack that just happened to be “blessed” by the standard.
      Even if I access the std::string buffer using &str[0], I still have to adjust the new string length using other hacks, like en explicit call to resize.
      But well, yes, possible.

  4. […] This post was mentioned on Twitter by Abstract Software, Visual C++. Visual C++ said: String API and internationalization http://bit.ly/gpbxb5 […]

Leave a comment