$> man42.net Blog written by a human

# Summary

  • You SHOULD use Unicode internally, preferably with UTF-8 when exchanging text with others.
  • Don't mix encodings: talk with your team to use the same encoding everywhere.
  • Explicitly specify encoding: configure your servers, edit your files (HTML/XML/...), etc.
  • More:
    • Write documentation to specify the encoding of your source files and outputs.
    • Read documentation/RFCs to know how to treat input texts and which encoding to use for outputs.
  • It's easier when using Unicode friendly programming languages or good libraries like Qt or the full featured ICU library.

# Introduction

Text encoding is seen by many developers as something that should work automagically. We just want to manipulate text, that's all!
When it doesn't, we often add a line about utf-8 or iso-8859-1 and hope it will work. Then we start to read more about encodings when something become... broken.

# Vocabulary

A character set is a table where numbers called code points are associated with characters: Unicode is a character set where the code point 233 refers to the character LATIN SMALL LETTER E WITH ACUTE.
A character encoding is a method to store these code points using bits: UTF-32 is a character encoding for Unicode that uses 32 bits per code point.

For example, a 8-bit encoded ASCII character 0b01100101 refers to the code point 101 in the ASCII character set: LATIN SMALL LETTER E.

For technical reasons, character sets having at most 256 code points are usually encoded using 8 bits (1 byte) per code point.
Thus, the same term is often used to refer both to character set and to character encoding: it's the case for ASCII.

That's why many people don't make the difference and just talk about charsets.

# History and Unicode

In 1963, ASCII appeared with 128 code points to represent English characters. It became the default character encoding on many systems.
Later, other parts of the world created their own character sets to handle custom characters.
In 1985, ISO-8859-1 appeared to extend the 128 ASCII code points with 128 additional code points to represent Western Europe characters.
As it was compatible with ASCII, ISO-8859-1 became the default character encoding on many systems (replacing ASCII).

Regarding to the many character sets in use and incompatibility issues, Unicode appeared in 1991 to unify all these character sets.

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. (Wikipedia)

Unicode provides:

  • A character set that aimed to represent most used characters in the world, including mathematical symbols (up to 2 097 152 code points (21 bits)).
  • Ways for combining characters together (for example, you can represent the character é as LATIN SMALL LETTER E WITH ACUTE or as LATIN SMALL LETTER E combined with the character ACUTE).
  • Rules for sorting, comparing, normalizing and transforming texts (like for removing accents, capitalize letters, etc).
  • Character encodings to store these code points.

The three common character encodings for Unicode are:

  • UTF-8: provides a way to store a Unicode code point using 1 to 4 elements of 8-bit.
  • UTF-16: provides a way to store a Unicode code point using 1 to 2 elements of 16-bit.
  • UTF-32: provides a way to store a Unicode code point using 1 element of 32-bit.

In 2012, UTF-8 is the default character encoding for many recent Linux distributions.

# Is UTF-8 the way to go?

When mixing character sets or character encodings, it becomes hard for developers and users to avoid issues.

The easiest way to solve most issues is to agree to use the same character encoding.

UTF-8 may already be seen as the de-facto default character encoding to use.
Although it is compatible with ASCII it is not compatible with ISO-8859-1. This slows down the global migration to UTF-8.
However, it's now the default character encoding in many places (Linux, CSS, XML, SIP, ...) and 69% of websites now use UTF-8 on March 2012.

Moreover, many languages and libraries evolve to provide a better support for Unicode (C11, C++11, Python 3...).

# What can you do?

You can use UTF-8 whenever you can and be consistent:

  • Configure your systems to use UTF-8 locales.
  • Use UTF-8 to encode your files.
  • When it's possible, add a header to your files to specify their encoding (HTML, CSS, XML, ...).
  • Talk with you team to use the same encoding as you.
  • If you provide an API, write which encoding to use for texts.
  • Configure your servers to explicit its encoding (for web servers it's a Content-Type header).

However be careful: there is no effective way to know which encoding is used by a text file. If your system was previously using a non-ASCII encoding, switching to UTF-8 can lead to encoding issues: I recommend to start with a fresh Linux install and/or to use iconv to convert your files.

# What else?

Well, you might still encounter some issues with files and APIs contaning non-ASCII characters and using another encoding.
However, this problem can be avoided with file types allowing to specify their encoding (HTML, CSS, XML, ...).

Otherwise, you will have to convert texts from the target encoding to your encoding (and vice versa) to fix the problem. To know which target encoding to use, you will have to:

  • Communicate with the file or API provider to ask its encoding (read the documentation).
  • Or heuristically guess the encoding (it's like guessing the language of a sentence - try ISO-8859-1 first ;)).

Then, you can use iconv (as command line tool or as C library) or others languages or libraries to convert between encodings during inputs and outputs.

For example, you can know the encoding of Unix files and console by looking at locales (man 3 nl_langinfo).

You also have to look at how your programming language deals with encoding: some are encoding agnostics and serve what you give them, others convert your texts to a specific encoding.

# That's all?

It depends of your needs.

If you only want to handle all the symbols of the world: yes, that's all.

But if you really want to reach a perfect internationalized software, you'll still have to think about details:

  • How to display dates in a familiar way to users?
  • How to sort texts depending on contexts (language & place)?
  • How to transliterate texts from one script into another?
  • How to simplificate searches with non-ASCII characters? (like treating e as equal to é and ê).
  • ...

The Unicode Consortium defined rules to help with these tasks.
Using libraries or languages that implement these rules will help you. It's the case of the ICU library.

# Good libraries for C/C++

For basic encoding conversions, you can look at the iconv library (man 3 iconv).

For easy conversions and manipulations of Unicode texts, the Qt library is really pleasant to use. Note that Qt 4 use Unicode internally, but treats all texts as ISO-8859-1 by default. You can change this behavior and use UTF-8 by default with: QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8")) and QTextCodec::setCodecForTr(QTextCodec::codecForName("UTF-8")).

For advanced internationalization, the ICU library seems to be the way to go (also exists for Java).

# Conclusion

As you see, mixing encodings leads to issues: work with your team to use the same character encoding and write documentation to explicitly mention which encoding you use.
As everybody migrates to UTF-8, the encoding issues will gradually decrease.

For more information, you can refer to the Unicode Faq and to the ICU User Guide.

Feel free to post any suggestions/corrections/questions!

Buffer this pageShare on TumblrDigg thisShare on FacebookShare on LinkedInTweet about this on TwitterEmail this to someoneShare on Google+Share on RedditPin on Pinterest