Note: You are currently viewing documentation for Moodle 2.3. Up-to-date documentation for the latest stable version is available here: UTF-8 scripts.

Development:UTF-8 scripts: Difference between revisions

From MoodleDocs
m (cat edit)
(53 intermediate revisions by 5 users not shown)
Line 1: Line 1:
[[UTF-8 migration]] > Recoding PHP scripts
[[UTF-8 migration]] > Recoding PHP scripts
''This page is under construction!!''
Only some preliminary ideas have been defined.

==Recoding PHP scripts==
==Recoding PHP scripts==
Line 17: Line 13:

For each modification we'll show if it's executed always or conditionally, using the "'''A'''" and "'''C'''" abbreviation. Also, the current status will be maintained with "'''N'''"ot implemented, "'''W'''"ork in progress and "'''D'''"one. Obviously, "'''D'''" is the desired final status for all the modifications.
For each modification we'll show if it's executed always or conditionally, using the "'''A'''" and "'''C'''" abbreviation. Also, the current status will be maintained with "'''N'''"ot implemented, "'''W'''"ork in progress and "'''D'''"one. Obviously, "'''D'''" is the desired final status for all the modifications.
Finally, please, you are welcome to add new items to the list, thanks!, but do it at the end (to maintain its current numeration).

=== The List ===
=== The List ===

Build one "check for 1.6 upgrade" utility under 1.5! (it should check for software present and lang packs used).
# (D,A) Build one "check for 1.6 upgrade" utility under 1.5. Now it's present in the admin page. It's able to check for BD, PHP and PHP libraries, allowing function execution and built over the new [ environmentlib.php] script. All the checks are defined under one simple XML file and one mechanism to update it from has been provided (using the new [ componentlib.class.php] library.
# <span style="color:red">(N, C)</span> POSTPONED: datalib.php to support collations under MySQL. This will allow to control language specific ordering from Moodle but 1.6.0 won't offer support for it. Instead, MySQL table/filed collation can be altered "manually" if the default unicode collation isn't enough.
datalib.php to support collations under MySQL
# (D,A) textlib.class to handle all those utf-compliant functions. It's working since 1.5.3. Based (wrapper) on [ Typo3] text handling libraries it offers support for conversion between charsets and a buch of functions like (substr, strtoupper, strpos...).
# (D,C) XML import/export (scorm, ims, backup/restore, glossary, quizzes...). Under UTF-8 mode, both utf8_encode() and utf8_decode() won't be needed anymore.
textlib.class to handle all those utf-compliant functions
# (D,A) Excel Export: A new [ excellib.class.php] class wrapper has been built. It works with some [ PEAR] libraries to be able to create UTF-16LE Excel files properly.
# (D,A) Modify every Excel generator to use the new library. This includes: grade/lib.php, course/grades.php, choice/report.php hotpot/report/default.php, quiz/report/analysis.php, quiz/report/overview.php, quiz/report/ and survey/download.php
XML import/export (scorm, ims, backup/restore, glossary, quizzes...)
# (D,A) Fixed the break_up_long_words() to work using the textlib functions.
# <span style="color:red">(W,A)</span> Fixed the glossary to find properly the pivot (initial letter) under UTF8 (Skodak has some ideas here about the alphabet for each lang.) Also, for ideogram-based langs, Moodle HQ is planning to do some work (after 1.6). [ Bug 6125]
excel export
# (D,A) Modify the rss_title() function to support UTF8 chars.
# <span style="color:red">(W,A)</span> htmlentities() to s() migration everywhere. [ Bug 6121]
htmlentities() ---> s() migration 1.5 and 1.6
# <span style="color:red">(W,A)</span> uses of substr, strlen, strpos, strtoupper... to use the new textlib class that offers utf8 savvy string manipulation functions. Some other functions like moodle_strtolower() will be modified to use new text library (done!). They will disappear after 1.6 (do it only in 1.7dev). [ Bug 6122]
# (D,A) Modify documentation to let users know how they MUST create their DB before installing Moodle 1.6 explaining all the benefits for being UTF-8 enabled sice the beginning.
uses of substr, strlen, strpos, regexp (both posix and perl), htmlspecialchars and htmlentities.
# (D,A) Modify the central installation script to:
#* Check DB encoding, warning if unicode hasn't been detected.
potentially filters...
#* Execute the environmental checks.
# (D,A) Modify the Windows32 Complete Package installation script to:
modify documentation to let users know how they MUST create their DB before installing Moodle.
#* Force DB creation under UTF-8.
#* Execute the environmental checks.
Modify creation scripts to use the UTF-8 encoding? Perhaps not necessary if DB has it defined as default? Test it.
# (D,A) The wiki module - this is due to the use of htmlentities() without specifying the character set. DFWiki does not have any known problems apparently.
# (D,A) UTF-8 national chars could not be used in paths/foldernames. Unicode characters in filenames can now be enabled by setting $CFG->unicodecleanfilename=true in config.php, though this option is not recommended. Please note that unicode characters in filenames may be broken during zip/unzip process, native info-zip binaries do not work at all on Windows; please use internal zipping/unzipping.
1) The wiki module - this is due to the use of htmlentities() without specifying the character set. DFWiki does not have any known problems apparently. Solved! 1.1) Pressing the edit button leads to a blank page, perhaps because double byte characters do nto work in the URL (see the source of the below)
# (D,A) GD support for UTF-8 strings. Perhaps it'll require some hacking + new fonts to be added (centrally or inside each lang pack). Basically graphlib supported UTF-8 fonts without problems at all. We've updated the central TrueType font and now it supports Latin, Cyrillic and Greek without problems. Some old not-needed fonts have been deleted. And unsupported languages will need to install their language fonts (under moodledata/lang/xx_utf8/fonts).このWikiは文字化けしますね。
# (D,C) RSS block. Working fine now with the new texlib.class.php library.
# (D,A) Languages list issues, not showing properly lang names. Not action required because now all those names are UTF-8 and they should work properly. I think it's an browser/font issue.
2) Japanese can't be used in paths/foldernames. This is not a Moodle problem but simply the limitations of the server I presume.
# (D,C) The Assingment module says that 0 words under some languages without word separators. A new configuration option has been added to count letters instead of words.
# (D,C) Modify the email_to_user() function to enable encoding of mails based on site setting and/or user preference. This is a must because a lot of mail clients/widgets don't support UTF-8 encodings. Done (email encoding now can be specified at site and user level plus support any charset (header + body).
3) I have not testsed GD and Japanese fonts but I guess that this may have problems as well.
# <span style="color:red">(N,C)</span> Look for all the occurrences of "en" or 'en' is source code and change them to 'en_utf8' is necessary. [ Bug 6124]
# (D,A) New lang edition interface for utf_8 lang packs. It should support dataroot lang packs, use 'en_utf8' as master language and "lock" the new langconfig.php file.
4) I have found what I think is a small bug. When I backed up and restored the course (choosing to add data to this course, and only restoring the quiz) I found that the quiz discription had added the remains of some html tags on at the beginning and end.
# (D,C) DB Migration with this functionalities:
#* Convert all the users/courses/site languages to their new alternative.
5) Also the News feed of the Asahi Newspaper ( - Japan's most famous newspaper, perhaps) is garbling. The reason is the use of break_up_long_words(). The current fix
#* Recover from crash smoothly.
#* Handle and central tables and official modules. Contrib modules should implement their own script. See [[UTF-8 contrib]] for more info.
will not work since all languages are now utf8 languages.
# <strike><span style="color:red">(N,C)</span> Modify the document_file() function to work properly (links to If documentation isn't going to be included with Moodle anymore, such function will disappear completely in a near future.</strike>
Over at the Japanese forums there are two suggestions
# (D,A) Modify the footer output to allow it to go to the proper (en, es...) wiki page in A list of available languages can be harcoded, defaulting to en.
# (D,C) Review the RSS feeds creator to detect if conversion to UTF-8 is needed.
One is to carefully calculate the position of inserted space so that it is inserted between multibyte characters. The final suggestion from Prof Nakayama is at
# (D,C) Analyse if we need the $SESSION->encoding. Removed from main CVS, 3rd party plugins must be updated anyway to use curent_curset() and $CFG->unicodedb to be fully utf8 and 1.6 compatible.
# (D,A) Delete all the get/print_string('thisencoding') and change them to the new current_charset() function.Seems finished.
However, this relies on the use of mbstring php extensions. Multibyte languages (using Kanji at least, but perhaps not Korean) can be wrapped onto the next line at any point, so if mbstring extensions are used, then, as prof Kariya suggests it is usually safe simply NOT to add any spaces into the string at all, so he recommends
# (D,C) Analyse and, if possible, implement a bit improved moodle_setlocale() function because of [[Table of locales|differences between Unix locales and Win32 locales]]. They should go to a new string inside each langconfig.php file and, after OS detection use the correct one. Done, 95% of Windows langs will work.
if(extension_loaded('mbstring')) return $string;
# (D,C) Implement one PEAR download utility if finally we cannot add it to standard distro (it's ready since some time ago under Needed to generate Excel files and, potentially, it'll grow. Finally we got perms from Xavier Noguer (see lib/pear/README.txt for a note about the PHP license and Moodle) and it has been included. Anyway, the pear download continues being generated in daily.
# (D,C) Handling of passwords. After DB conversion, password hashes are updated to utf8 during next user login. First is tried hash of unicode plain text password, then the plain text is converted into 'oldcharset' defined in the new language pack and its hash is checked again.
# (D,A) Modify (and potentially upgrade) the MyPHPAdmin module in order to recognise new lang pack names. Done: just support to utf8 langs added.
6) The language list drop down menu does not display properly. I am not sure if it should, and I don't really care (since I normally limit my site to a few languages) but it looks like this
# <span style="color:red">(N,C)</span> POSTPONED: Add one new parameter (xxx_original_encoding) to authentication methods in order to be able to convert from external sources (ldap, db...) encoding to utf8 if Moodle is running in that mode avoiding the current utf8_decode() implemented. [ Bug 6123]
# (D,A) Modify the install.php script to be able to detect DB encoding and warn about it plus use the new environment stuff to perform tests.
# (D,A) Create a collection of '''installer.php''' (to avoid conflicts with old install.php) files to be stored under the install/lang directory and to be used exclusively in the installation process. Hack get_string() to support this files ONLY in installation and make a script to be able to build them daily from contents existing in other language files (i.e, no manual handling of them!). Also, perhaps, add the possibility of language download at the end of the installation script.
7) The Assingment module says that 0 Words have been submitted after a Japanese language submission, because there are no spaces in Japanese.
==De-UTF8-ing for client side applications==
One of the biggest problems which remains after the move to UTF-8 is that while UTF-8 is great on the web a lot of client side software in Japan and China is *NOT* UTF-8 compatible.
The Japanese community's solution to this (provided originally by Mr. Kashiwagi, below) is:
1) At the inferface between Moodle and client side software (particularly email clients and spread sheet programs) Moodle checks to see if there is a /lib folder in the current language.
2) If a lang/xyx/lib folder is present, then the routines in that folder are used to convert the encoding to formats compatible with client side software. ( E.g. Even outlook expresses is not compatible with UTF8 in the subject line, and Excel can not deal with UTF8 either, so the lang/lib/ files contains code to covert the UTF8 to a client readable format. ) Contact with client side software occurs at the following points.
2.1) Email sent to Email clients
2.2) Grade Files export to Excel
2.3) Quizes and lessons imported from text editors
3) If the lang/xyz/lib folder is not present, then UTF8 encoding is used to talk to the client software as normal.

== Patches reference ==
== Patches reference ==

Post here all the solutions you know if you consider that they'll be interesting to solve some of the pending items in the list above.
Patches to allow Japanese Language Moodles that function without garbling have been prepared and are available at at the following sites.
Patches to allow Japanese Language Moodles that function without garbling have been prepared and are available at at the following sites.

Line 95: Line 81:
Many, or even most end-users (including myself = Tim) are not sufficiently confident making extensive patches, so our Moodles have been garbling in important areas (email/excel).
Many, or even most end-users (including myself = Tim) are not sufficiently confident making extensive patches, so our Moodles have been garbling in important areas (email/excel).

One interesting PHP-UTF8 reference:

1) Solved - tests in progress.
Great reference for Unicode Fonts: and

5) Has been solved by the use of a new multi-byte character compatible break_up_long_words().
For item #8: may be for letter-index use "distinct" first letters from existing data, and not lang alphabet? it can help in multilanguage glossary... --[[User:Ne Nashev|Ne Nashev]] 13:53, 29 March 2006 (WST)


Latest revision as of 10:17, 18 June 2007

UTF-8 migration > Recoding PHP scripts

Recoding PHP scripts


This page will show a list of well known UTF-8 related modifications to be applied to 1.6 in order to work better under Unicode. Some of the changes can be applied always while others would break compatibility with non UTF-8 sites so they have to be executed conditionally, i.e:

     if (!empty($CFG->unicodedb)) {
         //Code to be executed in UTF8 mode
     } else {
         //Old code

For each modification we'll show if it's executed always or conditionally, using the "A" and "C" abbreviation. Also, the current status will be maintained with "N"ot implemented, "W"ork in progress and "D"one. Obviously, "D" is the desired final status for all the modifications.

Finally, please, you are welcome to add new items to the list, thanks!, but do it at the end (to maintain its current numeration).

The List

  1. (D,A) Build one "check for 1.6 upgrade" utility under 1.5. Now it's present in the admin page. It's able to check for BD, PHP and PHP libraries, allowing function execution and built over the new environmentlib.php script. All the checks are defined under one simple XML file and one mechanism to update it from has been provided (using the new componentlib.class.php library.
  2. (N, C) POSTPONED: datalib.php to support collations under MySQL. This will allow to control language specific ordering from Moodle but 1.6.0 won't offer support for it. Instead, MySQL table/filed collation can be altered "manually" if the default unicode collation isn't enough.
  3. (D,A) textlib.class to handle all those utf-compliant functions. It's working since 1.5.3. Based (wrapper) on Typo3 text handling libraries it offers support for conversion between charsets and a buch of functions like (substr, strtoupper, strpos...).
  4. (D,C) XML import/export (scorm, ims, backup/restore, glossary, quizzes...). Under UTF-8 mode, both utf8_encode() and utf8_decode() won't be needed anymore.
  5. (D,A) Excel Export: A new excellib.class.php class wrapper has been built. It works with some PEAR libraries to be able to create UTF-16LE Excel files properly.
  6. (D,A) Modify every Excel generator to use the new library. This includes: grade/lib.php, course/grades.php, choice/report.php hotpot/report/default.php, quiz/report/analysis.php, quiz/report/overview.php, quiz/report/ and survey/download.php
  7. (D,A) Fixed the break_up_long_words() to work using the textlib functions.
  8. (W,A) Fixed the glossary to find properly the pivot (initial letter) under UTF8 (Skodak has some ideas here about the alphabet for each lang.) Also, for ideogram-based langs, Moodle HQ is planning to do some work (after 1.6). Bug 6125
  9. (D,A) Modify the rss_title() function to support UTF8 chars.
  10. (W,A) htmlentities() to s() migration everywhere. Bug 6121
  11. (W,A) uses of substr, strlen, strpos, strtoupper... to use the new textlib class that offers utf8 savvy string manipulation functions. Some other functions like moodle_strtolower() will be modified to use new text library (done!). They will disappear after 1.6 (do it only in 1.7dev). Bug 6122
  12. (D,A) Modify documentation to let users know how they MUST create their DB before installing Moodle 1.6 explaining all the benefits for being UTF-8 enabled sice the beginning.
  13. (D,A) Modify the central installation script to:
    • Check DB encoding, warning if unicode hasn't been detected.
    • Execute the environmental checks.
  14. (D,A) Modify the Windows32 Complete Package installation script to:
    • Force DB creation under UTF-8.
    • Execute the environmental checks.
  15. (D,A) The wiki module - this is due to the use of htmlentities() without specifying the character set. DFWiki does not have any known problems apparently.
  16. (D,A) UTF-8 national chars could not be used in paths/foldernames. Unicode characters in filenames can now be enabled by setting $CFG->unicodecleanfilename=true in config.php, though this option is not recommended. Please note that unicode characters in filenames may be broken during zip/unzip process, native info-zip binaries do not work at all on Windows; please use internal zipping/unzipping.
  17. (D,A) GD support for UTF-8 strings. Perhaps it'll require some hacking + new fonts to be added (centrally or inside each lang pack). Basically graphlib supported UTF-8 fonts without problems at all. We've updated the central TrueType font and now it supports Latin, Cyrillic and Greek without problems. Some old not-needed fonts have been deleted. And unsupported languages will need to install their language fonts (under moodledata/lang/xx_utf8/fonts).
  18. (D,C) RSS block. Working fine now with the new texlib.class.php library.
  19. (D,A) Languages list issues, not showing properly lang names. Not action required because now all those names are UTF-8 and they should work properly. I think it's an browser/font issue.
  20. (D,C) The Assingment module says that 0 words under some languages without word separators. A new configuration option has been added to count letters instead of words.
  21. (D,C) Modify the email_to_user() function to enable encoding of mails based on site setting and/or user preference. This is a must because a lot of mail clients/widgets don't support UTF-8 encodings. Done (email encoding now can be specified at site and user level plus support any charset (header + body).
  22. (N,C) Look for all the occurrences of "en" or 'en' is source code and change them to 'en_utf8' is necessary. Bug 6124
  23. (D,A) New lang edition interface for utf_8 lang packs. It should support dataroot lang packs, use 'en_utf8' as master language and "lock" the new langconfig.php file.
  24. (D,C) DB Migration with this functionalities:
    • Convert all the users/courses/site languages to their new alternative.
    • Recover from crash smoothly.
    • Handle and central tables and official modules. Contrib modules should implement their own script. See UTF-8 contrib for more info.
  25. (N,C) Modify the document_file() function to work properly (links to If documentation isn't going to be included with Moodle anymore, such function will disappear completely in a near future.
  26. (D,A) Modify the footer output to allow it to go to the proper (en, es...) wiki page in A list of available languages can be harcoded, defaulting to en.
  27. (D,C) Review the RSS feeds creator to detect if conversion to UTF-8 is needed.
  28. (D,C) Analyse if we need the $SESSION->encoding. Removed from main CVS, 3rd party plugins must be updated anyway to use curent_curset() and $CFG->unicodedb to be fully utf8 and 1.6 compatible.
  29. (D,A) Delete all the get/print_string('thisencoding') and change them to the new current_charset() function.Seems finished.
  30. (D,C) Analyse and, if possible, implement a bit improved moodle_setlocale() function because of differences between Unix locales and Win32 locales. They should go to a new string inside each langconfig.php file and, after OS detection use the correct one. Done, 95% of Windows langs will work.
  31. (D,C) Implement one PEAR download utility if finally we cannot add it to standard distro (it's ready since some time ago under Needed to generate Excel files and, potentially, it'll grow. Finally we got perms from Xavier Noguer (see lib/pear/README.txt for a note about the PHP license and Moodle) and it has been included. Anyway, the pear download continues being generated in daily.
  32. (D,C) Handling of passwords. After DB conversion, password hashes are updated to utf8 during next user login. First is tried hash of unicode plain text password, then the plain text is converted into 'oldcharset' defined in the new language pack and its hash is checked again.
  33. (D,A) Modify (and potentially upgrade) the MyPHPAdmin module in order to recognise new lang pack names. Done: just support to utf8 langs added.
  34. (N,C) POSTPONED: Add one new parameter (xxx_original_encoding) to authentication methods in order to be able to convert from external sources (ldap, db...) encoding to utf8 if Moodle is running in that mode avoiding the current utf8_decode() implemented. Bug 6123
  35. (D,A) Modify the install.php script to be able to detect DB encoding and warn about it plus use the new environment stuff to perform tests.
  36. (D,A) Create a collection of installer.php (to avoid conflicts with old install.php) files to be stored under the install/lang directory and to be used exclusively in the installation process. Hack get_string() to support this files ONLY in installation and make a script to be able to build them daily from contents existing in other language files (i.e, no manual handling of them!). Also, perhaps, add the possibility of language download at the end of the installation script.

Patches reference

Post here all the solutions you know if you consider that they'll be interesting to solve some of the pending items in the list above.

Patches to allow Japanese Language Moodles that function without garbling have been prepared and are available at at the following sites.

These patches are explained here

Mr. Kashiwagi's "Supertak" Patch (described in the thread above)

Prof Kita's patches and rpms (based in part on Mr. Kashiwagi's)
A read me file in English describing the rpm A patch describing all the things that need to be done Many, or even most end-users (including myself = Tim) are not sufficiently confident making extensive patches, so our Moodles have been garbling in important areas (email/excel).

One interesting PHP-UTF8 reference:

Great reference for Unicode Fonts: and

For item #8: may be for letter-index use "distinct" first letters from existing data, and not lang alphabet? it can help in multilanguage glossary... --Ne Nashev 13:53, 29 March 2006 (WST)