This SectionSite Wide

Ektron's Developer Group Blog

A blog for Ektron users, by Ektron Developers

Internationalization: strings with variables

(XLIFF, Localization) Permanent link

Internationalization: strings with variables

The topic of internationalization (abbreviated I18N because there are 18 letters between 'I' and 'N') is rather extensive and complex. It is the process of preparing content, especially source code, to be translated for other locations or locales. The process of actually making content suitable for other locations is aptly named localization (L10N for short). The part of the process most people are already familiar is translation, where the text is changed from one natural language, e.g., English, to another, e.g., Spanish. Localization is more than just translation because it includes cultural differences between countries (idioms, taboos, etc), numeric, date and monetary formats, and even the way colors and graphical symbols are used. In the United States, red means stop, yellow means caution and green means go. These meanings are from the colors used in traffic signals. In other countries, these colors may or may not have the same connotation.

If you want to make just one small change in how you write source code, regardless of the programming language, there is one thing that will pay big dividends when it is time to internationalize and localize your code. In fact, this technique makes I18N part of the code construction and doesn't leave it for later when it is more expensive and risky to change existing, working code.

When coding, frequently you need to form a string with values from one or more variables. Take the follow sentence, for example, where the file name is from a variable.

The file, Example.txt, is missing.

The classic (and bad) way to create the sentence is code is:

"The file, " + strFileName + ", is missing."

This, however, might literally need to be translated as:

Is missing, the Example.txt file.

Because of the way the sentence is concatenated, the translation could not be made satisfactorily. The translator has two pieces of the sentence, "The file, " and ", is missing.".

Avoid Concatenating Sentences

Use a string Format or printf function when supported by the programming language.

"The file, %s, is missing." (C)
"The file, {0}, is missing." (C#, VB.NET)

This way, the translator has the entire sentence as one piece to translate.

For JavaScript, include the following code that implements a simple 'format' function.

// adapted from
String.format = function()
    // e.g., String.format("hello {0}", "world")
    if (0 == arguments.length)
        return "";
    var str = arguments[0];

    for (var i = 1; i < arguments.length; i++)
        var re = new RegExp("\\{" + (i-1) + "\\}(?!\})","gm");
        str = str.replace(re, arguments[i]);
    return str;

Restructure the Sentence

Alternately, if a format function is not available, restructure the sentence so it is in one string and the variable follows.

Missing file: Example.txt

"Missing file: " + strFileName

The key is to keep the sentence or phrase as a unit and not concatenate words to make a sentence.

The wording is perhaps not as elegant, but it is understandable. More is to be gained by having informative messages, like the one below, than by having polished wording.

The file is missing. Check that the file exists and that the name is correctly typed.
File: Example.txt

More than one variable

Multiple variables are supported better by .NET than C because the order of the variables can be changed by the translator.

"'{0}' cannot have the value '{1}'."

might be literally translated as

"Value '{1}' is prohibited in '{0}'."

Even if the language does not support reordering, it is better than concatenation.

Numbers and Plurality

Numbers add another level of complexity because of plurality. Consider these sentences for example.

There are no people in the room.
There is one person in the room.
There are two people in the room.
There are 11 people in the room.

A simple format string might be:

"There are {0} people in the room."

But this, of course, does not handle the singular case very well.

There are 1 people in the room.

This format string seeks to handle plurality, but is awkward and unruly.

"There is/are {0} person/people in the room."

There is/are 1 person/people in the room.

Code can decide between multiple format strings based on checking the value of the number. For example,

Value is 1 use "There is one person in the room."
Otherwise use "There are {0} people in the room."

The results are probably acceptable in most languages.

There are 0 people in the room.
There is one person in the room.
There are 2 people in the room.
There are 3 people in the room.

In English, count is limited to singular (1) and plural (not 1). Other languages have more choices, for example, dual (2 as in a 'pair'). Some, such as Polish, are even more complex. Polish differs depending on whether the number ends in 2, 3, or 4. Even in English, the concept exists. For instance, we have seen how using the words "no", "one", "two", etc. are preferred to "0", "1", and "2" for small numbers. Another example is the "st", "nd", "rd" and "th" ending, as in:

1st, 2nd, 3rd, 4th, 5th, ..., 12th, ..., 22nd, and so on.

The rule to determine the ending is much more complex than just singular and plural. I know of no good way to handle this case without writing code that examines the count and selects the string accordingly.

case n = 0:
case n = 1:
case n = 2:
case n = 3:
case n mod 10 = 1 and n > 19:
case n mod 10 = 2 and n > 19:
case n mod 10 = 3 and n > 19:

You will need to know the requirements for the locations your content will be viewed and make the best compromises of cost and quality of translation.

Instead of complex coding and multiple strings to translate, simply structure the sentence to avoid plurality.

"Number of people in the room: " + nNumPeople

Number of people in the room: 1


Use the Format function to avoid concatenating strings to form sentences or at least structure the sentence so that the variable follows the complete sentence.


Unicode and UTF-8 Encoding

(Unicode, Encoding) Permanent link

One of the biggest problems I've found when talking about Unicode and character encoding that it is so easy to misunderstand how encoding works. Joel Spolsky has written a decent blog on the topic broadly without requiring you to read a 600 page book. Here is Joel's blog

Additionally, several Ektron articles exist on the topic and cover some specific concepts as they relate to Ektron products.

Or you can search the Ektron Knowledge Base for "Unicode" for more articles.

When in doubt, use UTF-8 encoding. It supports all languages that can be represented by Unicode. UTF-8 is widely supported by browsers, databases and XML parsers. The Ektron CMS uses UTF-8. There should be no reason to use other older encodings such as Shift-JIS or ISO-8859.

Remember that computer languages, such as JavaScript, C#, VB.NET, etc, all use Unicode and do not need to encode strings stored in memory.