Note: You are currently viewing documentation for Moodle 1.9. Up-to-date documentation for the latest stable version is available here: Languages.

Development:Languages

From MoodleDocs
Revision as of 18:11, 12 February 2010 by David Mudrak (talk | contribs)

Note: This article is a work in progress. Please use the page comments or an appropriate moodle.org forum for any recommendations/suggestions for improvement.


Languages subsystem improvements
Project state Working on spec
Tracker issue MDL-18797 MDL-15252
Discussion [1]
Assignee David Mudrak

Moodle 2.0


This is a proposal and specification of changes to the language strings processing in Moodle 2.0 and 2.1. You may want to start looking at the overview of proposed changes mindmap.

Current issues

Why the change is needed:

String files are not branched
We must keep all strings from all branches in place for backwards compatibility and we are unable to easily clean up language packs. Some say the branching and merging is too big toast for our translators (see MDL-15252).
Plural forms, gender forms and other grammar
We are unable to handle plurals at all. For example, handling plural forms in gettext is traditional, well tested and robust way (see MDL-4790). MDL-12433 by Sam Marshal shows alternative approach based on logical expressions.
Strings can't be modified
It is difficult to notify translators that some string was modified (expanded, fixed, changed) - as in this case, for example. The current work around is the policy of adding another string with the same suffixed name (like 'license2'). It would be nice if such strings were tagged/highlighted in the translation UI.
We do not use standard formats
Translators can't use specialized tools for translation (PO/gettext editors, community translation portals). Also, I (David) am not aware of any benchmarking showing the performance differences between out native $string[] format compared to, for example, standard .po format.
More syntax checks are required
So the translators do not brake Moodle functionality (see MDL-12433)
Language packs are PHP code, but stored in moodledata
This increases the severity of some security exploits. It means that any exploit that lets you write files to an arbitrary location in moodledata suddenly lets you execute arbitrary PHP code on the server. On the other hand, it would be nice to be able to allow complex logic when evaluating dynamic strings (ie such containing $a param/params).
Right-to-left languages
There are problems reported in RTL languages when using online tools (including the our current one) which lead to putting placeholders like a$ and a$->lastname into the string definition.

Goals

  1. Fix all the issues listed above
  2. Do not reinvent the wheel
  3. Keep "do one thing and do it well" principle
  4. Keep it simple and stupid.
  5. Have the translation process as simple as possible - translators are not geeks
  6. Make simple things easy and hard things possible
  7. Make this available for Moodle 2.0

Key design questions

What is the data structure for storing the master copies of the lang packs that translators work on
At the moment it is plain PHP array, editable via translation UI or directly. Petr proposes a change to keeping these strings in database, sort of syncable with some central storage. Whatever the format is, we must be able to store some metadata - the timestamp of the last modification, the author name, proposed alternatives, comments etc (see rosetta translation tool at launchpad for the example of possible metadata)
What is the UI for translators, what are the processes of contributing and how the translations are redistributed to Moodle sites
Our translators should not be forced to use the only one possible tool. We should consider switching to a standardized common format (like PO or XLIFF) that is supported by a variety of advanced tools (equipped with translation memory, connected with dictionaries, i18n portals etc).
What is the data structure Moodle uses at runtime
This is just a performance optimization (implementation detail), should be independent on the native format that humans work with so it could be modified anytime in the future. For example, see the system proposed by Tim based on calling class methods (inspired by Perl's Maketext).
What is the format of a lang string, and how are placeholders substituted
It is strongly tied together with the runtime format, it can be changed any time. On the other hand, both the UI and storage format must support it.

Miscellaneous suggestions

  • Store downloaded lang packs in a new location $CFG->langpacks, which defaults to $CFG->datadir/lang. Paranoid admins can change this to a different location that is normally read-only for the web server. Then they will manually switch to read-write when they are performing an upgrade, doing lang editing, or installing a lang pack. The UI should therefore check whether $CFG->langpacks is writable before starting any of these operations, and explain the situation to admins if it is not.
  • Use sort of template syntax so translators can replace static strings with a template. Syntax can be similar to what Smarty and other templating engines use. So far we should be fine with a basic set of {if} {else}, eq, gt, lt and some math oparators (including modulo). Such templates would be compiled into proper PHP code once during lang pack compilation. Strict rules shall apply so it should reduce the risk of executing malicious code.
  • Or have a possibility to use a PHP function to actually return the translated string instead of a static string definition.

Use cases

  1. Developers add new strings to the code and commit their work into CVS (core or contrib)
  2. Developers can add a comment to the string, eg. "this string is used for ..."
  3. Developers add new string and link it with a current one with an explanation eg "this string replaces ...". Such links are available to translators and help them to decide the correct translation.
  4. Translators translate strings on several Moodle branches and do not need to worry about branching, commiting and merging
  5. Translators can see the list of untranslated strings and translate them
  6. Translators can see the list of outdated translated strings (aka English string was changed) and update the translation
  7. Admins can locally modify the language pack for their site
  8. Community members can propose alternative translations. They are reviewed by the lang pack maintainer and may be approved.

Research

This is the list of projects, resources and tools that were explored before writing this spec

  • Great CPAN article about software localization. Plain string based lexicon is not enough. Strings can be translated by functions only. "A phrase is a function; a phrasebook is a bunch of functions."
  • XLIFF - XML Localization Interchange File Format
  • Virtaal - promising, we could have XLIFF <-> .php conversion
  • Launchpad - translation portal used by Ubuntu and many other projects. Would require BSD licensing, therefore IMO not suitable as we could not import our current GPL'ed translation. Seems to be quite slow during the process.
  • Plural forms in gettext
  • Zend_Translate reference guide
  • MDL-12433 - Sam Marshal's proposal
  • MediaWiki approach: Grammar forms and plurals: is are (Example of how mediawiki outputs the correct given pluralization form depending on the count. Plural transformations are used for languages like Russian based on "count mod 10").

Functional proposals

Overall strings processing flow

(Follow the attached UML flow diagram)

UML: Overall string processing flow
  • All string definitions are kept in a central repository in some storage format which supports branching. Officially maintained language packs are referred to as master in this proposal. Every language pack can have its parent defined. The English language pack can be seen as the greatest common parent of all language packs.
  • During upgrade or on demand, the relevant branch of master language packs are fetched (downloaded) automatically from the central repository. Together with the selected language, all its parents, grandparent, great-grandparents, ... etc are downloaded, too.
  • Administrators can keep local modifications (customizations) of any master pack. We call them local language packs.
  • Immediately after upgrade (or again, on demand), the string definitions are merged from all available sources. The merge logic is so that the sources for any given string are evaluated in the order like: fr_ca_local, fr_ca, fr_local, fr, en_local, en. Strings are merged for the performance reasons so that the searching for the string to use (local, parent, master, English etc.) is done just once and we do not need to load all possible sources on runtime. After the merge, we have a single place to look for the string definition for every installed language.
  • Together with the merge, strings are compiled into a runtime format that may be optimised in the future. Humans do not modify the compiled format. Strings must be re-merged and re-compiled after any update of master or local packs. During the compilation, syntax checks are performed.
  • The runtime format we will start with will be very similar with the current one. Strings are defined as array elements indexed by the string identifier. The arrays are defined in separate files for every module. We can, however, modify this in the future. For example, we can divide string definitions into files not by the module name but by the real usage frequency. Strings that are used very often (like at every page) would go into common file which can be loaded during bootstrap. This would reduce memory usage and number of I/O operations.

Drop the utf8 suffix from the language codes

Languages will be reffered to as "en", "cs", "en_us" or "fr_local" etc. Directories will be renamed.

Locations of lang files

All core plugins will have their language files in their own scope as the contrib plugins have. So for example workshop strings will be defined in '/mod/workshop/lang/en/workshop.php' instead of legacy '/lang/en_utf8/workshop.php'

HTML help files replaced with ordinary strings

Help should consist of a paragraph or two of a static HTML only, the rest goes to wiki. We will drop help files indices (index of all help files) in favour of wiki. This will make the translation easier as we do not need to have other tools to translate help files.

To decide: help strings can be stored either a separate file, eg /mod/workshop/lang/en/workshop_help.php. Or together with other workshop strings, eg $string['intro'] = 'Workshop intro'; $string['intro_help'] = 'This is the ...';. My +1 for the later form as it allows to keep the string and their help together which helps translators to keep translation consistency.

Other changes

  • The only valid placeholder in runtime format is {$a} for strings and numbers and {$a->foobar} for objects MDL-18841.
  • Around 90% of our strings do not contain any placeholder and they will be immediately returned by get_string().
  • If the string contains one or more placeholders, they are replaced with their eval()-uated result. We can safely eval() the whole string definition because the string compiler makes sure that the placeholders are the only executable/evaluable code. All other malicious code and $variables are properly quoted/escaped/htmlentitled.
  • If the string is defined as NULL, corresponding function defined in the language pack library is called, passing $a as parameter. So get_string('foo', 'bar', $a); would return the value returned by eg lang_cs_bar_foo($a) if the current language is Czech. Power translators may use such functions to properly handle plural forms and other grammar aspects.
  • Valid format of string identifier must be defined. These identifiers may be used as associative array keys (as in $string['identifier']), function names, HTML form field names, file names etc. No characters like '*' can be used (as happened in MDL-21375).
  • Some string identifiers are not hardcoded (explicitly referenced) in the code but computed, eg. get_string('level' . $level, 'arcade') with strings 'level1', 'level2', 'level3' etc being defined. It is difficult to automatically search for the usage of such strings. Therefore, I (David) propose the following guideline (rule):
    String identifiers must follow our convention for naming $variables: single lowercase word, no underscore. The colon (:) sign can be used only in strings with the names of capabilities and nowhere else. The minus sign (-) can be used only in string identifiers witch are partially computed, as in get_string('region-' . $regionid, 'theme_example') so that is a sign that we should be careful when trying to find an usage of such string in the code.

Central web-based translation tool

See Development:Languages/AMOS

Implementation proposals

Petr's proposal to store strings in one central database and to disable direct commits.

--David Mudrak 22:43, 23 November 2009 (UTC): I disagree with the "no change meaning" rule. IMO if we have a system how to track changes and mainly how to inform translators that their translation is outdated, we can fix/update/extend English string as needed. Together with branching, this will lead to a nice "reduced" packs without redundancy. Also note we must find a way how to combine this approach with the grammar issues (plural forms etc) that will probably have to be solved as proper PHP functions/class methods...

File format translators work with

Translators will be encouraged to use the central web-based only. There will be a way to provide the translated work in the current Moodle strings format (it est $string[] array).

Translation tools and the process

See MDL-15252 (Cleanup of English language pack) and the discussion at http://moodle.org/mod/forum/discuss.php?d=118707 for Koen's proposition. Branching issue, the translation process and other aspects discussed there.

From Martin in Dev chat: if you want crazy ideas, how about get_string returns some special tags and those tags get converted to ajax on the GUI so that translators can translate directly in the main Moodle GUI?

What a cool idea. Could be a special mode you have to turn on in the admin screens. Perhaps even if you turned this mode on, it would still only be active for people with certain roles, or perhaps when it was turned on, it would have to apply to all roles, so that you could edit strings for not-logged-in users. Anyway, when this mode was on, it would:

  1. Adds <span class="moodle-lang-string" id="lang_string|admin|langedit">Language editing around each string on the page - to use one example.
  2. $PAGE->requires->js an extra JS file that adds an on-click handler to all such spans, so that when you click on it, it pops up the language editing UI in a YUI dialogue.
David Mudrak 14:07, 23 November 2009 (UTC): the solution based on wrapping <span> around every string was already considered and dropped. It may badly break XHTML as the string itself may appear as a value of an HTML tag's attribute: <img title="<span class="moodle-lang-string" .... We are unable to say the scope where the string will appear.
David's contra-proposal: get_string() could track all strings used at the current page and the AJAX form to edit them all could be rendered before the footer(). Or 'Edit system text on this page' link would appear there.

Runtime file format

Getting the history of strings into database

Source code management systems can't show us how the given string evolved during the time. Translators on the other hand need to know "personal history" of any given string, each with the author of the change, date, comment etc. String timeline will be populated from DB. To let it work, we will need to load all the strings history into DB. That can be done using our git mirror. This conversion is done:

  1. Regularly for the master English strings that are part of Moodle source code. David already has a prototype of script to track commits into languages files and updates the database.
  2. Once for all other languages. When we have the new translation portal prepared, we will stop supporting direct CVS commits into languages. The history of all commits into all lang packs will be transferred into the database and further translations will be recorded there.

David is writing these migration script as a part of his Development:Languages/AMOS tool.

See also