Internationalization Cookbook
This is my personal blog. The views expressed on these pages are mine alone and not those of my employer.

Data internationalization

Q. How can I serialize or transfer data between localized versions of my application?

A. Use locale-independent data formats.

Introduction

One can find many articles and books dealing with internationalization, but most of them only address what the user can see: UI localization, date/time/number/currency formats, sorting, case conversion, and so on. So much so, that many definition of internationalization include some references to being able to input/process/output data in a locale aware way.

In a few cases you can also see hints of something else, which I would call “Data internationalization.” In Nadine Kano’s classic book on Windows internationalization, one of the rules reads “All language editions can read one another’s documents.” And in the second edition of the book you can find a short paragraph (7 lines) in the XML section “Make Sure Your XML Data Is Locale- and Culture-Neutral.”

In this article I will try to do some “hair splitting” on this topic.

Why

Why do you have to care about this? Because you often have to exchange data between applications. And not only between your application and others, but also between the various versions of your application. This includes data stored in files (the native file format or some form of interchange file format, XML based or not), or network communications. It is something you have already considered (maybe) for cross-platform data exchange, but you have to extend it to cross-language data exchange.

In a world where the refrigerators gets connected to the Internet, and goods ordered by European clients from an U.S. server are shipped from China, this is more important than ever.

When

In general, the best practice is to do any conversions right before presenting the data to the user and before doing the data transfer.

It is a good thing to convert between the stored data format to the platform format when loading/receiving-saving/sending the data, to ease data processing. You might use UTF-8 for transfer and another format for manipulation (something matching what the native Unicode API expects).

Even if the encoding changes, the data should still be locale independent during processing. Locale-aware formatting/parsing should happen right before the presentation layer.

How

Encoding

Use Unicode. It does not matter what format you go with, but it should be Unicode and it should be consistent.

What you should consider:

  • The Unicode Transformation Format (UTF-7, UTF-8, UTF-16, or UTF-32)
  • Little-endian, big-endian, or both. If you decide to support both, you will have to use the BOM (Byte Order Mark).
  • Normalization Form. If your application is Windows only, then Windows uses Normalization Form C, so there is no much need to think abut it, but if cross-platform communication is required, then you should take some decisions.

Sometimes you have to use some encoding other than Unicode in order to support legacy systems. In this case do the conversion to the legacy code page right before the date leaves your application, and convert to Unicode before the data enters your application.

Formatting

The main rule is to use locale independent formats, use existing standards when available (ISO covers a lot). Even if storing a locale-specific format together with the locale might seem like a good idea, it is not. This means that all data consumers should understand all formats for all locales.

In general, especially if you need cross-platform compatibility, binary formats are loaded with traps. This is one of the reasons XML is more and more the preferred format. In this article I will use some fictional XML elements and attributes, but this does not mean that the guidelines are not applicable to other formats.

Numbers

If the data is Windows only, binary formats should not be a problem. But if text formats are used, then don’t use thousand separators, use the dot as decimal separator and use minus in front of negative numbers, like in most programming languages.

<int value="-12345.123" />

Storing the numbers according to the current locale can lead to data that is totally incomprehensible. Just think about 123,456. No one can tell if the comma is a decimal separator or a thousand separator.

Currency

The number format should follow the number format described above, but the currency code should be also stored. The currency should use the ISO 4217 currency codes.

<price amount="1234567.987" currency="USD" />

The currency can be stored separately (in which case there is no need for parsing), or it can be stored together with the value (in which case the order of the currency symbol might need to change for the user).

Using the ISO codes eliminates confusion: the ‘$’ symbol is used for many other currencies (Argentina, Australia, Brunei, Canada, Chile, Columbia, Ecuador, El Salvador, Mexico, New Zealand, Singapore), not only for the U.S. dollar.

Also, having a locale independent identifier (3 char ISO code) allows you to show localized names to the user. Examples for USD: "$", "US$", "dollar des États-Unis", "Доллар США" etc.

Depending on the application, if the exchange rates are important, a timestamp (see date/time format), source and target currencies, and an exchange ratio should also be stored.

<price
    amount="1234567.987"
    currency="USD"
    currency_from="EUR"
    rate="1.24330" 
    timestamp="2005-09-11T18:31:17Z"
/>

Date and time, calendar, time zone

Use ISO 8601. It covers date and time representations, including time zones.

Time zones affect even single-locale applications. Imagine a bidding that should end at 5:00 pm, taking place on a server in New York. Clients in California should have no problems with the scheduling and participating in the bidding.

It is also a good practice to use the UTC, because ISO 8601 does not cover day saving time issues, but most OS/libraries have API for conversions between UTC and local time.

<timestamp value="2005-09-11T18:31:17Z" />

The Gregorian calendar is the most popular one, and it is supported by all operating systems. All libraries/operating systems supporting alternate calendars (lunar, imperial, Buddhist, etc.), also have API to convert to and from the Gregorian calendar.

Locale

Sometimes you have to specify the locale in which the data is stored (especially for data that cannot really be represented in a locale independent way, like for instance messages).

If the application is Windows only, then the locale ID (LCID) is a good choice.

For cross-platform communication you can use RFC 3066 or the Unicode Technical Standard 35 (TR-35). There is a good overlap (they are both based on ISO 3166 and ISO 639), but TR-35 is more complete. It is also newer, which means it is not yet supported by all libraries or operating systems. It is definitely a bad idea to create your own locale identifier system.

<message locale="en_US">Hello world</message>

Measurement units

Like for currency, you can either convert everything to one measurement unit, or store the original values and the corresponding measurement unit (to will avoid rounding problems).

If you choose to convert everything to one measurement system, the international metric system is the obvious choice (see ISO 31).

<size value="1231234.123123" unit="m" />

This might lead to complex conversions between various XML schemas, but it is definitely better than misinterpreting the information stored.

A good example of complex conversion is between the various car fuel consumption standards. In the U.S., this is expressed by the number of miles one can drive with a gallon of gas. In Europe, it is the number of liters of gas needed to drive 100 kilometers.

Separators

I am not sure if they deserve their own section, but it is important to keep the separators locale independent in the storage format.

Imagine the CSV (Comma Separated format) and the implications of changing comma to match the list separator of the locale (semi-colon in some locales) and the numbers being represented with thousand and decimal separator in a locale-dependent way, then moving the file to another locale. Or just try using CSV with Excel on various locales :-)

Others

There are other elements that can be stored in a language independent way. You should do this whenever you can.

The three entries below represent the same color, but the last one is language dependent:

<color value="#FFFF00" method="RGB" />
  <color value="60,100,100" method="HSB" />
  <color value="yellow" />

Storing programming scripts in the documents is another example. You can have your scripting language based on English (like most programming languages are) or you can have a "localizable" programming language, which means that you should again store it in a language-independent way (some form of byte-code). Opening the document in another locale will automatically "translate" all the keywords.

But you should be careful if you choose the second method. Microsoft Office 95 did exactly this, but failed to provide a language independent way to send a new-line using SendKey. So an English script calling SendKey "{Enter}" failed on a German Office, which expected "{Eingabe}".

This was fixed long ago, and you should take care not repeat the mistakes of the past :-)

Bibliography

Leave a comment