Languages subsystem improvements 2.0: Difference between revisions

Latest revision as of 14:54, 11 August 2021

Languages subsystem improvements
Project state	Complete
Tracker issue	MDL-18797 MDL-15252
Discussion	[1]
Assignee	David Mudrak

Moodle 2.0

Warning: This page was once used as an initial specification of the project. Only parts of it were finally implemented. It is kept for archive purposes only.

This is a specification of changes to the language strings processing in Moodle 2.0 and 2.1. You should start looking at the overview of proposed changes mindmap.

Current issues

Why the changes in the current tools and process are needed:

String files are not branched: We must keep all strings from all branches in place for backwards compatibility and we are unable to easily clean up language packs. Some say the branching and merging is too much work for our translators (see MDL-15252).
Plural forms, gender forms and other grammar: We are unable to handle plurals at all. For example, handling plural forms in gettext is traditional, well tested and robust way (see MDL-4790). MDL-12433 by Sam Marshall shows alternative approach based on logical expressions.
Strings can't be modified: It is difficult to notify translators that some string was modified (expanded, fixed, changed) - as in this case, for example. The current work around is the policy of adding another string with the same suffixed name (like 'license2'). It would be nice if such strings were tagged/highlighted in the translation UI.
We do not use standard formats: Translators can't use specialized tools for translation (PO/gettext editors, community translation portals). Also, I (David) am not aware of any benchmarking showing the performance differences between out native $string[] format compared to, for example, standard .po format.
More syntax checks are required: So the translators do not brake Moodle functionality (see MDL-12433)
Language packs are PHP code, but stored in moodledata: This increases the severity of some security exploits. It means that any exploit that lets you write files to an arbitrary location in moodledata suddenly lets you execute arbitrary PHP code on the server. On the other hand, it would be nice to be able to allow complex logic when evaluating dynamic strings (ie such containing $a param/params).
Right-to-left languages: There are problems reported in RTL languages when using online tools (including the our current one) which lead to putting placeholders like a$ and a$->lastname into the string definition.

Goals

Fix all the issues listed above
Do not reinvent the wheel
Keep "do one thing and do it well" principle
Keep it simple and stupid.
Have the translation process as simple as possible - translators are not geeks
Make simple things easy and hard things possible
Make most of this available for Moodle 2.0

Key design questions

When working on the specification, these were the key questions to keep in mind:

What is the data structure for storing the master copies of the lang packs: At the moment, strings are defined in plain PHP associative arrays, editable via translation UI or directly. These arrays are stored in PHP files in Moodle CVS. We are going to change it. According to the original Petr Skoda's proposal, the primary place for keeping all translated strings will be a central database at one of moodle.org servers. Together with the translation itself, usefule meta-data are kept in the database: the timestamp of the last modifications, links to the revision of the English string that the translation is based on, the author name, proposed alternatives, comments etc (see rosetta translation tool at launchpad for the example of possible metadata).
What is the UI for translators, what are the processes of contributing and how the translations are redistributed to Moodle sites: Thanks to storing all strings in a database, we will be able to produce their list in various standardised formats (like PO or XLIFF). Therefore, the translators will not be forced to use the only one possible tool but can use their favourite advanced tool with its own translation memory, connected with dictionaries, i18n portals etc. Our central strings repository will support data export and import from/into various formats.
What is the data structure that get_string() uses at runtime: This is just a performance optimization (implementation detail), should be independent on the native format that humans work with so it could be modified anytime in the future. For example, see the system proposed by Tim based on calling class methods (inspired by Perl's Maketext).
What is the format of a lang string, and how are placeholders substituted: It is strongly tied together with the runtime format, it can be changed any time. On the other hand, both the UI and storage format must support it.

Use cases

Developers add new strings to the code and commit their work into CVS (core or contrib)
Developers can add a comment to the string, eg. "this string is used for ..."
Developers add new string and link it with a current one with an explanation eg "this string replaces ...". Such links are available to translators and help them to decide the correct translation.
Developers moves the string from one component (like enrol.php) into another (like enrol_manual.php). The current translations are re-used.
Developers change the string identificator, for example from configfoobar to foobar_desc.
Translators translate strings on several Moodle branches and do not need to worry about branching, commiting and merging
Translators can see the list of untranslated strings and translate them
Translators can see the list of outdated translated strings (aka English string was changed) and update the translation
Admins can locally modify the language pack for their site
Community members can propose alternative translations. They are reviewed by the lang pack maintainer and may be approved.

Research

This is the list of projects, resources and tools that were explored before writing this spec

Great CPAN article about software localization. Plain string based lexicon is not enough. Strings can be translated by functions only. "A phrase is a function; a phrasebook is a bunch of functions."
XLIFF - XML Localization Interchange File Format
Virtaal - promising, we could have XLIFF <-> .php conversion
Launchpad - translation portal used by Ubuntu and many other projects. Would require BSD licensing, therefore IMO not suitable as we could not import our current GPL'ed translation. Seems to be quite slow during the process.
Plural forms in gettext
Zend_Translate reference guide
MDL-12433 - Sam Marshal's proposal
MediaWiki approach: Grammar forms and plurals:
```
{{plural:1|is|are}} {{plural:2|is|are}}
```
(Example of how mediawiki outputs the correct given pluralization form depending on the count. Plural transformations are used for languages like Russian based on "count mod 10").

Functional proposals

Overall strings processing flow

(Follow the attached UML flow diagram)

UML: Overall string processing flow

All string definitions are kept in a central database togeteher with other all meta-data (branch, where does the string come from, what was its history etc.). Officially maintained language packs are referred to as master in this proposal. Every language pack can have its parent defined. The English language pack can be seen as the greatest common parent of all language packs.
During upgrade or on demand, the relevant branch of master language packs are fetched (downloaded) automatically from the central database. Together with the selected language, all its parents, grandparent, great-grandparents, ... etc are downloaded, too.
Administrators can keep local modifications (customizations) of any master pack. We call them local language packs.
Immediately after upgrade (or again, on demand), the string definitions are merged from all available sources. The merge logic is so that the sources for any given string are evaluated in the order like: fr_ca_local, fr_ca, fr_local, fr, en_local, en. Strings are merged for the performance reasons so that the searching for the string to use (local, parent, master, English etc.) is done just once and we do not need to load all possible sources on runtime. After the merge, we have a single place to look for the string definition for every installed language. By default, the location for these merged strings will be $CFG->dataroot/cache/lang/XX/component.php where XX is the language code (fr_ca, fr, en in this example). The benefit is that get_string() can rely on always having the string defined in this file, no other seeking and I/O is needed in runtime.
Together with the merge, strings are compiled into a runtime format that may be optimised in the future. Humans do not modify the compiled format. Strings must be re-merged and re-compiled after any update of master or local packs. During the compilation, syntax checks are performed.
The runtime format we will start with will be very similar with the current one. Strings are defined as array elements indexed by the string identifier. The arrays are defined in separate files for every module. We can, however, modify this in the future. For example, we can divide string definitions into files not by the module name but by the real usage frequency. Strings that are used very often (like at every page) would go into common file which can be loaded during bootstrap. This would reduce memory usage and number of I/O operations.
There will be a way how to let get_string() call a PHP function/method to actually return the translated string instead of a static string definition. This will allow advanced translators to deal with many grammar issues they have in their languages (notably singular v plural forms).

Naming and locations

Languages will be reffered to as "en", "cs", "en_us" or "fr_local" etc. Directories will be renamed.
Downloaded lang packs are stored in $CFG->langpacks, which is by default $CFG->datadir/lang. Paranoid admins can change this to a different location that is normally read-only for the web server. Then they will manually switch to read-write when they are performing an upgrade, doing lang editing, or installing a lang pack. The UI should therefore check whether $CFG->langpacks is writable before starting any of these operations, and explain the situation to admins if it is not.
After mergin and compiling, the strings in runtime format are stored in $CFG->dataroot/cache/lang/ (see above).
All core plugins will have their language files in their own scope as the contrib plugins have. So for example workshop strings will be defined in '/mod/workshop/lang/en/workshop.php' instead of legacy '/lang/en_utf8/workshop.php'

HTML help files replaced with ordinary strings

Help should consist of a paragraph or two of a static HTML only, the rest goes to wiki. We will drop help files indices (index of all help files) as well as other dynamically generated helps in favour of wiki. This will make the translation easier as we do not need to have other tools to translate help files.

Help strings are kept together with other workshop strings as in

$string['intro'] = 'Workshop intro'; $string['intro_help'] = 'This is the ...';

. It allows to keep the string and their help together which helps translators to keep translation consistency.

See Help strings for further information.

Other changes

The only valid placeholder in runtime format is
```
{$a}
```
for strings and numbers and
```
{$a->foobar}
```
for objects MDL-18841.
Around 90% of our strings do not contain any placeholder and they will be immediately returned by get_string().
If the string contains one or more placeholders, they are replaced with their eval()-uated result. We can safely eval() the whole string definition because the string compiler makes sure that the placeholders are the only executable/evaluable code. All other malicious code and $variables are properly quoted/escaped/htmlentitled.
Valid format of string identifier must be defined. These identifiers may be used as associative array keys (as in
```
$string['identifier']
```
), function names, HTML form field names, file names etc. No characters like '*' can be used (as happened in MDL-21375).
Some string identifiers are not hardcoded (explicitly referenced) in the code but computed, eg.
```
get_string('level' . $level, 'arcade')
```
with strings 'level1', 'level2', 'level3' etc being defined. It is difficult to automatically search for the usage of such strings. Therefore, I (David) propose the following guideline (rule):
String identifiers must follow our convention for naming $variables: single lowercase word, no underscore. The colon (:) sign can be used only in strings with the names of capabilities and nowhere else. The minus sign (-) can be used only in string identifiers witch are partially computed, as in
```
get_string('region-' . $regionid, 'theme_example')
```
so that is a sign that we should be careful when trying to find an usage of such string in the code. The underscore is reserved for special _help and _desc suffixes.

Central web-based translation tool

See Automated_Manipulation_of_Strings_2.0

Implementation proposals

Read Petr's proposal to store strings in one central database and to disable direct commits. That is roughly valid with the exception that the "no change meaning" rule is not forced. AMOS is able to track changes and inform translators that their translation is outdated. So we will be able to fix/update/extend English string as needed. Together with branching, this will lead to a nice "reduced" packs without redundancy.

Translation tools and the process

See MDL-15252 (Cleanup of English language pack) and the discussion at http://moodle.org/mod/forum/discuss.php?d=118707 for Koen's proposition. Branching issue, the translation process and other aspects discussed there.

Some notes:

From Martin in Dev chat: if you want crazy ideas, how about get_string returns some special tags and those tags get converted to ajax on the GUI so that translators can translate directly in the main Moodle GUI?

What a cool idea. Could be a special mode you have to turn on in the admin screens. Perhaps even if you turned this mode on, it would still only be active for people with certain roles, or perhaps when it was turned on, it would have to apply to all roles, so that you could edit strings for not-logged-in users. Anyway, when this mode was on, it would:

Adds <span class="moodle-lang-string" id="lang_string|admin|langedit">Language editing around each string on the page - to use one example.
$PAGE->requires->js an extra JS file that adds an on-click handler to all such spans, so that when you click on it, it pops up the language editing UI in a YUI dialogue.

David Mudrak 14:07, 23 November 2009 (UTC): the solution based on wrapping <span> around every string was already considered and dropped. It may badly break XHTML as the string itself may appear as a value of an HTML tag's attribute:

&lt;img title="&lt;span class="moodle-lang-string" ...

. We are unable to say the scope where the string will appear.

David's contra-proposal: get_string() could track all strings used at the current page and the AJAX form to edit them all could be rendered before the footer(). Or 'Edit system text on this page' link would appear there.

Getting the history of strings into database

Source code management systems can't show us how the given string evolved during the time. Translators on the other hand need to know "personal history" of any given string, each with the author of the change, date, comment etc. String timeline will be populated from DB. To let it work, we will need to load all the strings history into DB. That can be done using our git mirror. This conversion is done:

Regularly for the master English strings that are part of Moodle source code. David already has a prototype of script to track commits into languages files and updates the database.
Once for all other languages. When we have the new translation portal prepared, we will stop supporting direct CVS commits into languages. The history of all commits into all lang packs will be transferred into the database and further translations will be recorded there.

David is writing these migration script as a part of his Languages/AMOS tool.

Design decisions and voting

MDL-21690 Central master database of all strings and their translations
MDL-21635 Decide how to implement information in langconfig.php in AMOS
MDL-21693 Drop _utf8 suffix from language codes and folder names
MDL-21694 Move language files to the plugin space
MDL-21695 Replace built-in HTML help files with proper strings

@@ Line 67: / Line 67: @@
 * [http://framework.zend.com/manual/en/zend.translate.html Zend_Translate reference guide]
 * MDL-12433 - Sam Marshal's proposal
-* MediaWiki approach: [http://www.mediawiki.org/wiki/Manual:$wgGrammarForms Grammar forms] and plurals: <code>{{plural:1|is|are}} {{plural:2|is|are}}</syntaxhighlight> (Example of how mediawiki outputs the correct given pluralization form depending on the count. Plural transformations are used for languages like Russian based on "count mod 10").
+* MediaWiki approach: [http://www.mediawiki.org/wiki/Manual:$wgGrammarForms Grammar forms] and plurals: <syntaxhighlight lang="php">{{plural:1|is|are}} {{plural:2|is|are}}</syntaxhighlight> (Example of how mediawiki outputs the correct given pluralization form depending on the count. Plural transformations are used for languages like Russian based on "count mod 10").
 == Functional proposals ==
@@ Line 94: / Line 94: @@
 Help should consist of a paragraph or two of a static HTML only, the rest goes to wiki. We will drop help files indices (index of all help files) as well as other dynamically generated helps in favour of wiki. This will make the translation easier as we do not need to have other tools to translate help files.
-Help strings are kept together with other workshop strings as in <code>$string['intro'] = 'Workshop intro'; $string['intro_help'] = 'This is the ...';</syntaxhighlight>. It allows to keep the string and their help together which helps translators to keep translation consistency.
+Help strings are kept together with other workshop strings as in <syntaxhighlight lang="php">$string['intro'] = 'Workshop intro'; $string['intro_help'] = 'This is the ...';</syntaxhighlight>. It allows to keep the string and their help together which helps translators to keep translation consistency.
 See [[Help strings]] for further information.
@@ Line 100: / Line 100: @@
 === Other changes ===
-* The only valid placeholder in runtime format is <code>{$a}</syntaxhighlight> for strings and numbers and <code>{$a->foobar}</syntaxhighlight> for objects MDL-18841.
+* The only valid placeholder in runtime format is <syntaxhighlight lang="php">{$a}</syntaxhighlight> for strings and numbers and <syntaxhighlight lang="php">{$a->foobar}</syntaxhighlight> for objects MDL-18841.
 * Around 90% of our strings do not contain any placeholder and they will be immediately returned by get_string().
 * If the string contains one or more placeholders, they are replaced with their eval()-uated result. We can safely eval() the whole string definition because the string compiler makes sure that the placeholders are the only executable/evaluable code. All other malicious code and $variables are properly quoted/escaped/htmlentitled.
-* Valid format of string identifier must be defined. These identifiers may be used as associative array keys (as in <code>$string['identifier']</syntaxhighlight>), function names, HTML form field names, file names etc. No characters like '*' can be used (as happened in MDL-21375).
+* Valid format of string identifier must be defined. These identifiers may be used as associative array keys (as in <syntaxhighlight lang="php">$string['identifier']</syntaxhighlight>), function names, HTML form field names, file names etc. No characters like '*' can be used (as happened in MDL-21375).
-* Some string identifiers are not hardcoded (explicitly referenced) in the code but computed, eg. <code>get_string('level' . $level, 'arcade')</syntaxhighlight> with strings 'level1', 'level2', 'level3' etc being defined. It is difficult to automatically search for the usage of such strings. Therefore, I (David) propose the following guideline (rule):<br />String identifiers must follow our convention for naming $variables: single lowercase word, no underscore. The colon (:) sign can be used only in strings with the names of capabilities and nowhere else. The minus sign (-) can be used only in string identifiers witch are partially computed, as in <code>get_string('region-' . $regionid, 'theme_example')</syntaxhighlight> so that is a sign that we should be careful when trying to find an usage of such string in the code. The underscore is reserved for special _help and _desc suffixes.
+* Some string identifiers are not hardcoded (explicitly referenced) in the code but computed, eg. <syntaxhighlight lang="php">get_string('level' . $level, 'arcade')</syntaxhighlight> with strings 'level1', 'level2', 'level3' etc being defined. It is difficult to automatically search for the usage of such strings. Therefore, I (David) propose the following guideline (rule):<br />String identifiers must follow our convention for naming $variables: single lowercase word, no underscore. The colon (:) sign can be used only in strings with the names of capabilities and nowhere else. The minus sign (-) can be used only in string identifiers witch are partially computed, as in <syntaxhighlight lang="php">get_string('region-' . $regionid, 'theme_example')</syntaxhighlight> so that is a sign that we should be careful when trying to find an usage of such string in the code. The underscore is reserved for special _help and _desc suffixes.
 === Central web-based translation tool ===
@@ Line 127: / Line 127: @@
 # $PAGE->requires->js an extra JS file that adds an on-click handler to all such spans, so that when you click on it, it pops up the language editing UI in a YUI dialogue.
-:: [[User:David Mudrak|David Mudrak]] 14:07, 23 November 2009 (UTC): the solution based on wrapping &lt;span&gt; around every string was already considered and dropped. It may badly break XHTML as the string itself may appear as a value of an HTML tag's attribute: <code>&lt;img title="&lt;span class="moodle-lang-string" ...</syntaxhighlight>. We are unable to say the scope where the string will appear.
+:: [[User:David Mudrak|David Mudrak]] 14:07, 23 November 2009 (UTC): the solution based on wrapping &lt;span&gt; around every string was already considered and dropped. It may badly break XHTML as the string itself may appear as a value of an HTML tag's attribute: <syntaxhighlight lang="php">&lt;img title="&lt;span class="moodle-lang-string" ...</syntaxhighlight>. We are unable to say the scope where the string will appear.
 :: David's contra-proposal: get_string() could track all strings used at the current page and the AJAX form to edit them all could be rendered before the footer(). Or 'Edit system text on this page' link would appear there.

Documentation