UTF-8 current: Difference between revisions

Revision as of 21:30, 25 January 2006

UTF-8 migration > UTF-8 justification > Current situation

Current situation

Up to series 1.5.x Moodle has been able to run under a lot of different languages. Each of them is provided in the form of a "Language Pack" where all the strings and help files are translated properly.

Each of this languages has the ability to define the character encoding that is going to be used when any page is sent to the browser. Furthermore such encoding is used when the user submits any form info back to Moodle. Finally, such info in manipulated by Moodle and sent to the database backend.

As languages can be determined and/or forced by Moodle at user level, course level and site level, the application knows, at any time, the encoding it has to stamp in the web page sent to the browser.

Although this schema is working relatively fine since some time ago, some important limitations are currently present and they need to be solved:

Multilingual Sites (or, more properly, multiencoding sites): As we have seen above, each web page have one - only one! - character encoding. As traditional encodings were "exclusive" because they didn't support all the characters in the world, it's absolutely impossible to mix contents from different encodings unless we use one universal encoding. UTF-8 for example.

PHP's Difficulties: PHP has offered, in the past, a good support for different encodings, specially for those that are using only one byte per character. Support for multibyte encodings was less than perfect and some commonly used functions don't work against those encodings. Trying to add support for all the currently used encodings could be really difficult because each one has its own characteristics. So, once more, we need to go to only one multibyte encoding. UTF-8 can be it.

Lack of Collation: One important characteristic of every language is their collation rules. Basically such rules define how to sort alphabetically different strings. And those rules go from the simplest ASCII ordering to complex norms where uppercase letters are different than lowercase (in terms of graphemes or sorting) or a collection of characters have a specific place when they are together. All these strategies must be supported from Database and recent versions of both MySQL and PostgreSQL have some powerful features around this. More yet, the UTF-8 encoding defines some universal collations that can de used by default everywhere and we have the possibility to be more precise at DB level using some specialized collations too.

Exchange of Information: Moodle has a lot of features related to this. From loading users and enrolments, the import and export of glossaries, pools of questions... to the backup and restore of courses functionality, a lot of files are loaded and saved when working with Moodle. Generating all those files under different encodings can be really hard to maintain and, worse, will break compatibility between files generated under exclusive encodings. Again, we need to homogenize how all those files are handled. The solution, UTF-8, yep!

Adoption of Standards: With the arrive of a plethora of new standards (XHTML, Scorm, IMS Content Package, IMS Learning Design, SOAP, XML-RPC...) a new way to work is becoming more and more important every day: XML files. All the modern specifications about storing and sending information use this markup language internally. And, although other are supported, one encoding is the king of this documents. Do you guess which one? Of course, UTF-8, once more.

All these problem have been more obvious for people using encodings different from the ISO-8859-1 that is shared by a lot of Latin-based languages (English, French, Spanish, German, Italian...), mainly because both the web technology used (browsers), the programming environment (PHP) and the database backend (MySQL, PostgreSQL...) had a very limited support for the rest of encodings.

But nowadays, things have changed a lot. From browsers supporting different encodings to both PHP (with alternatives like mbstring or iconv) and Database improvements, make it practicable to tackle the problems above with high possibilities of success.

And, here, the proposed solution.

@@ Line 15: / Line 15: @@
 * '''PHP's Difficulties''': PHP has offered, in the past, a good support for different encodings, specially for those that are using only one byte per character. Support for multibyte encodings was less than perfect and some commonly used functions don't work against those encodings. Trying to add support for all the currently used encodings could be really difficult because each one has its own characteristics. So, once more, we need to go to only one multibyte encoding. UTF-8 can be it.
-* '''Lack of Collation''': One important characteristic of every language are they collation rules. Basically such rules define how to sort alphabetically different strings. And those rules go from the simplest ASCII ordering to complex norms where uppercase letters are different than lowercase (in terms of graphemes or sorting) or a collection of characters have a specific place when they are together. All these strategies must be supported from Database and recent versions of both MySQL and PostgreSQL have some powerful features around this. More yet, the UTF-8 encoding defines some universal collations that can de used by default everywhere and we have the possibility to be more precise at DB level using some specialized collations too.
+* '''Lack of Collation''': One important characteristic of every language is their collation rules. Basically such rules define how to sort alphabetically different strings. And those rules go from the simplest ASCII ordering to complex norms where uppercase letters are different than lowercase (in terms of graphemes or sorting) or a collection of characters have a specific place when they are together. All these strategies must be supported from Database and recent versions of both MySQL and PostgreSQL have some powerful features around this. More yet, the UTF-8 encoding defines some universal collations that can de used by default everywhere and we have the possibility to be more precise at DB level using some specialized collations too.
 * '''Exchange of Information''': Moodle has a lot of features related to this. From loading users and enrolments, the import and export of glossaries, pools of questions... to the backup and restore of courses functionality, a lot of files are loaded and saved when working with Moodle. Generating all those files under different encodings can be really hard to maintain and, worse, will break compatibility between files generated under exclusive encodings. Again, we need to homogenize how all those files are handled. The solution, UTF-8, yep!

Documentation

UTF-8 current: Difference between revisions

Revision as of 21:30, 25 January 2006

Current situation