UTF-8 introduction

Warning: This page is no longer in use. The information contained on the page should NOT be seen as relevant or reliable.

UTF-8 migration > UTF-8 justification > UTF-8 introduction

Characters, charsets and encodings

For historical reasons out of the scope of this text, people in the world use different languages to communicate. And every language uses its own group of characters (charset). Although some characters are shared between different languages, each one has its own, particular and unique graphemes or symbols (from simple diacritics and particular letters to completely different alphabets).

All these differences between charsets have been in the past (remaining today!) a big headache in our technological world, where our electronic devices have to send and receive information in interaction with others (devices or persons).

Along the years, a lot of initiatives (from governments, international organizations, private companies...) have been performed to solve this problem. Each one has developed their own way to represent characters electronically. Each group of pairs between the character representation (how it looks) and how it's stored internally by electronic devices is commonly called character encoding.

Some of these encodings are widely used today by computers and all sort of devices, including support for more that one language and has been adopted by the international community (for example, the ISO-8859-1 encoding, also called ISO-Latin1, is shared by a lot of latin based languages: English, French, Spanish... You can see a good list here.

But practically all these encodings lack some features:

They encode a limited range of characters: A lot of them use only one byte to store the representation of the character. While this is good in terms of space used, it restricts the number of characters that can be encoded (only 256 theoretical different values).

They are not compatible at all between them: Different encodings use the same electronic representation to show different characters. This force applications trying to work with different encodings at the same time to perform a lot of conversions and checks to know what encoding is being used in every place.

They are exclusive: Although some of them support more than one language, their audience isn't universal but is really limited (well, some of them are used by millions of people, but that isn't enough!).

They are too many: If you have seen the list above and investigated a bit under some of the encodings, it's easy to see that there are more than one encoding for the same language (see the Windows-XXXX, being minor supersets of the ISO-XXXX encodings, or the different Japanese alternatives available...).

Centering us in the computer world, all these problems are absolutely real, for either the final user or the developer perspective. All the components of any actual application (database, programming language, keyboard...) have different levels of acceptance (support) for each encoding.

As you can see, this is a nightmare! And the solution seems to be, after a quick thought, really simple: Let's use one encoding big enough to represent any character in any language and use it everywhere! Easy!

By using it, every piece of textual information (one book, one phrase...) will be stored in the computer, displayed, printed and transmitted following some common rules (the encoding itself) without the need to translate (re-encode) all that information continuously. Something like a worldwide Esperanto, but for electronic devices.

Does it exist? Sure, it's Unicode!

Documentation

UTF-8 introduction

Characters, charsets and encodings