Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

Languages subsystem improvements 2.0: Difference between revisions

From MoodleDocs
m (Added initial ideas on translation activity module)
Line 97: Line 97:
* MDL-18797
* MDL-18797
* lives in contrib, translators and administrators wanting to customize the language pack install it
* lives in contrib, translators and administrators wanting to customize the language pack install it
:: -- [[User:Nicolas Martignoni|Nicolas Martignoni]] 08:02, 9 December 2009 (UTC): IMHO translation tool have to live in core, not in contrib, as translators should have total confidence in its functionalities (contrib is frequently associated with "unstable" and/or "not maintained"). Moreover, such a tool has to be always up-to-date.
* various capabilities to propose (submit) alternative translations, to integrate proposals, rate them, export them, send to upstream etc.
* various capabilities to propose (submit) alternative translations, to integrate proposals, rate them, export them, send to upstream etc.
* such activity can be used by individuals, translation teams, can handle community-based models etc.
* such activity can be used by individuals, translation teams, can handle community-based models etc.

Revision as of 08:02, 9 December 2009

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Languages subsystem improvements
Project state Research and planning
Tracker issue MDL-18797 MDL-15252
Discussion [1]
Assignee David Mudrak

Moodle 2.0


This is an initial proposal of changes to the language strings processing in Moodle.

Current issues

String files are not branched
We must keep all strings from all branches in place for backwards compatibility and we are unable to easily clean up language packs. Some say the branching and merging is too big toast for our translators.
Plural forms, gender forms and other grammar
We are unable to handle plurals at all. For example, handling plural forms in gettext is traditional, well tested and robust way (see MDL-4790). MDL-12433 by Sam Marshal shows alternative approach based on logical expressions.
Strings can't be modified
It is difficult to notify translators that some string was modified (expanded, fixed, changed) - as in this case, for example. The current work around it the policy of adding another string with the same suffixed name (like 'license2'). Would be nice if such strings were tagged/highlighted in the translation UI.
We do not use standard formats
Translators can't use specialized tools for translation (PO/gettext editors, community translation portals). Also, I am not aware of any benchmarking showing the performance differences between out native $string[] format compared to, for example, standard .po format.
More syntax checks are required
So the translators do not brake Moodle functionality (see MDL-12433)
Language packs are PHP code, but stored in moodledata
This increases the severity of some security exploits. It means that any exploit that lets you write files to an arbitrary location in moodledata suddenly lets you execute arbitrary PHP code on the server. On the other hand, it would be nice to be able to allow complex logic when evaluating dynamic strings (ie such containing $a param/params).
Right-to-left languages
There are problems reported in RTL languages when using online tools (including the our current one) which lead to putting placeholders like a$ and a$->lastname into the string definition.

Goals

  1. Do not reinvent the wheel. Keep "do one thing and do it well" principle. Keep it simple and stupid. Have the translation process a simple as possible (translators are not geeks).
  2. Make simple things easy and hard things possible

Key design questions

What is the data structure for storing the master copies of the lang packs that translators work on
At the moment it is plain PHP array, editable via translation UI or directly. Petr proposes a change to keeping these strings in database, sort of syncable with some central repo. Whatever the format is, we must be able to store some metadata - the timestamp of the last modification, the author name, proposed alternatives, comments etc (see rosetta translation tool at launchpad for the example of possible metadata)
What is the UI for translators, what are the processes of contributing and how the translations are redistributed to Moodle sites
Out translators should not be forced to use the only one possible tool. We should consider switching to a standardized common format (like PO or XLIFF) that is supported by a variety of advanced tools (equipped with translation memory, connected with dictionaries, i18n portals etc).
What is the data structure Moodle uses at runtime
This is just a performance optimization (implementation detail), should be independent on the native format that humans work with so it could be modified anytime in the future. For example, see the system proposed by Tim based on calling class methods (inspired by Perl's Maketext).
What is the format of a lang string, and how are placeholders substituted
This is the most important issue we have at the moment but as it is strongly tied together with the runtime format, it can be changed any time. On the other hand, both the UI and storage format must support it.

Miscellaneous suggestions

  • Store downloaded lang packs in a new location $CFG->langpacks, which defaults to $CFG->datadir/lang. Paranoid admins can change this to a different location that is normally read-only for the web server, but which they will switch to read/write when then are performing an upgrade, doing lang editing, or installing a lang pack. The UI should therefore check whether $CFG->langpacks is writable before starting any of these operations, and explain the situation to admins if it is not.
  • Use sort of template syntax so translators can replace static strings with a template. Syntax can be similar to what Smarty and other templating engines use. So far we should be fine with a basic set of {if} {else}, eq, gt, lt and some math oparators (including modulo). Such templates would be compiled into proper PHP code once during lang pack compilation. Strict rules shall apply so it should reduce the risk of executing malicious code.

Use cases

  1. Developers add new strings to the core
  2. Translators translate untranslated core strings and publish their work
  3. Admins want to locally modify the language pack
  4. Contributors add new string to the contributed code
  5. Translators translate untranslated contrib strings and publish their work
  6. Admins don't want PHP code stored in moodledata.
  7. ...

Research

This is the list of projects, resources and tools being explored

  • Great CPAN article about software localization. Plain string based lexicon is not enough. Strings can be translated by functions only. "A phrase is a function; a phrasebook is a bunch of functions."
  • XLIFF - XML Localization Interchange File Format
  • Virtaal - promising, we could have XLIFF <-> .php conversion
  • Launchpad - translation portal used by Ubuntu and many other projects. Would require BSD licensing, therefore IMO not suitable as we could not import our current GPL'ed translation. Seems to be pretty slow during the process.
  • Plural forms in gettext
  • Zend_Translate reference guide
  • MDL-12433 - Sam Marshal's proposal
  • MediaWiki approach: Grammar forms and plurals: is are (Example of how mediawiki outputs the correct given pluralization form depending on the count. Plural transformations are used for languages like Russian based on "count mod 10").

Functional proposals

Overall strings processing flow

(Follow the attached UML flow diagram)

UML: Overall string processing flow
  • All string definitions are kept in a central repository in some storage format which supports branching. Officially maintained language packs are referred to as master in this proposal. Every language pack can have its parent defined. The English language pack can be seen as the greatest common parent of all language packs.
  • During upgrade or on demand, the relevant branch of master language packs are fetched (downloaded) automatically from the central repository. Together with the selected language, all its parents, grandparent, great-grandparents, ... etc are downloaded, too.
  • Administrators can keep local modifications (customizations) of any master pack. We call them local language packs.
  • Immediately after upgrade (or again, on demand), the string definitions are merged from all available sources. The merge logic is so that the sources for any given string are evaluated in the order like: fr_ca_local, fr_ca, fr_local, fr, en_local, en. Strings are merged for the performance reasons so that the searching for the string to use (local, parent, master, English etc.) is done just once and we do not need to load all possible sources on runtime. After the merge, we have a single place to look for the string definition for every installed language.
  • Together with the merge, strings are compiled into a runtime format that may be optimised in the future. Humans do not modify the compiled format. Strings must be re-merged and re-compiled after any update of master or local packs. During the compilation, syntax checks are performed.
  • The runtime format we will start with will be very similar with the current one. Strings are defined as array elements indexed by the string identifier. The arrays are defined in separate files for every module. We can, however, modify this in the future. For example, we can divide string definitions into files not by the module name but by the real usage frequency. Strings that are used very often (like at every page) would go into common file which can be loaded during bootstrap. This would reduce memory usage and number of I/O operations.
  • The only valid placeholder in runtime format is {$a} for strings and numbers and {$a->foobar} for objects.
  • Around 90% of our strings do not contain any placeholder and they will be immediately returned by get_string().
  • If the string contains one or more placeholders, they are replaced with their eval()-uated result. We can safely eval() the whole string definition because the string compiler makes sure that the placeholders are the only executable/evaluable code. All other malicious code and $variables are properly quoted/escaped/htmlentitled.
  • If the string is defined as NULL, corresponding function defined in the language pack library is called, passing $a as parameter. So get_string('foo', 'bar', $a); would return the value returned by eg lang_cs_bar_foo($a) if the current language is Czech. Power translators may use such functions to properly handle plural forms and other grammar aspects.

Mental model of branching for translators

(Follow the attached scheme)

Mental model of branching

It may not be trivial to understand the principle of branching and merging as they seem to be quite geeky tech terms. Maybe the following model based on layers can be more suitable for good translators. Basically, strings are seen as being part of a layer. For every Moodle release, a new layer is put on the top of all other layers. The get_string() at every release looks at its layer for the string definition. If the string is defined on that layer, the first underlying layer the string is defined at is used. So, for example:

  • stringid01 was defined in Moodle 1.9 (it was part of the Moodle language pack before we did the big Moodle 2.0 cleanup). It remains the same for all following releases.
  • stringid02 was defined in Moodle 1.9. Then it changed in Moodle 2.0 and then again in Moodle 2.3. In Moodle 2.1 and 2.2, the version defined in 2.0 is used.
  • stringid03 was introduced in Moodle 2.0. There is no need to translate it for the previous releases because it does not exist there.
  • stringid04 is similar to stringid02. It appeared sometimes/somewhere in Moodle 1.x (until Moodle 2.0, we did not branch), was changed in 2.1 and then again in 2.2
  • stringid05 was dropped in Moodle 2.0 and is not part of the language pack any more. There is no need to translate it for Moodle 2.x
  • stringid06 was introduced in Moodle 2.3
  • stringid07 was dropped in Moodle 2.0 but then it reincarnated in Moodle 2.2 and is the part of the languages again since then (even its definition could change)
  • stringid08 (you now should be able to explain yourself ... :-)

Translation tool as an activity module

  • MDL-18797
  • lives in contrib, translators and administrators wanting to customize the language pack install it
-- Nicolas Martignoni 08:02, 9 December 2009 (UTC): IMHO translation tool have to live in core, not in contrib, as translators should have total confidence in its functionalities (contrib is frequently associated with "unstable" and/or "not maintained"). Moreover, such a tool has to be always up-to-date.
  • various capabilities to propose (submit) alternative translations, to integrate proposals, rate them, export them, send to upstream etc.
  • such activity can be used by individuals, translation teams, can handle community-based models etc.
  • strings are saved in DB with all metadata, history etc.
  • XLIFF used as exchange format with upstream
  • CVS access replaced with a web service at a dedicated lang.moodle.org server

Implementation proposals

Petr's proposal to store strings in one central database and to disable direct commits.

--David Mudrak 22:43, 23 November 2009 (UTC): I disagree with the "no change meaning" rule. IMO if we have a system how to track changes and mainly how to inform translators that their translation is outdated, we can fix/update/extend English string as needed. Together with branching, this will lead to a nice "reduced" packs without redundancy. Also note we must find a way how to combine this approach with the grammar issues (plural forms etc) that will probably have to be solved as proper PHP functions/class methods...

File format translators work with

Translation tool a the process

See MDL-15252 (Cleanup of English language pack) and the discussion at http://moodle.org/mod/forum/discuss.php?d=118707 for Koen's proposition. Branching issue, the translation process and other aspects discussed there.

From Martin in Dev chat: if you want crazy ideas, how about get_string returns some special tags and those tags get converted to ajax on the GUI so that translators can translate directly in the main Moodle GUI?

What a cool idea. Could be a special mode you have to turn on in the admin screens. Perhaps even if you turned this mode on, it would still only be active for people with certain roles, or perhaps when it was turned on, it would have to apply to all roles, so that you could edit strings for not-logged-in users. Anyway, when this mode was on, it would:

  1. Adds <span class="moodle-lang-string" id="lang_string|admin|langedit">Language editing around each string on the page - to use one example.
  2. $PAGE->requires->js an extra JS file that adds an on-click handler to all such spans, so that when you click on it, it pops up the language editing UI in a YUI dialogue.
David Mudrak 14:07, 23 November 2009 (UTC): the solution based on wrapping <span> around every string was already considered and dropped. It may badly break XHTML as the string itself may appear as a value of an HTML tag's attribute: <img title="<span class="moodle-lang-string" .... We are unable to say the scope where the string will appear.
David's contra-proposal: get_string() could track all strings used at the current page and the AJAX form to edit them all could be rendered before the footer(). Or 'Edit system text on this page' link would appear there.

Runtime file format

See also