Note:

This site is no longer used and is in read-only mode. Instead please go to our new Moodle Developer Resource site.

Database collation issue

From MoodleDocs
Revision as of 07:01, 25 October 2012 by Tim Hunt (talk | contribs)

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Database collation issue
Project state Specification in progress
Tracker issue TBA
Discussion TBA
Assignee


Background

Database store text data using a particular character set. Moodle uses UTF-8, which is a good thing, since UTF-8 can represent almost any character ever.

Text data is also stored with an associated collation. This affects how the values are sorted and compared. For example, different languages have different sort-orders, which depend on which alphabet they use. For example in German the ß character should come just after S (and s).

Collations typically come in two forms, case sensitive, and case-insensitive. If you are using a case-insensitive collation, then not only do 'x' and 'X' get sorted the same, but actually, the database thinks that 'x' = 'X'.

Operations like JOINs are typically much faster if both tables (columns) use the same collation.

The problems

Collation should depend on current language

On multi-lingual sites like moodle.org, different users will all expect reports to be sorted according to their language's collation rules. If we just specify the collation in database table definition, then all queries for all users have to use the same collation.

Typical bugs:

  • MDL-30446 Glossary order issues for some collation settings (Postgres)

Machine-readable identifiers

Moodle stores two sorts for text data:

  • user-visible strings like user.firstname, user.lastname or course.fullname
  • machine-readable identifiers like user.username, course.idnumber or file.filepath

For machine-readable identifiers, the 'x' == 'X', or worse 'e' == 'é' behaviour is really bad, and has lead to bugs, for example:

  • MDL-29332 Calculated Questions Require Case Sensitive Data Tables (MySQL)
  • MDL-33655 MySQL case insensitivity breaks get_directory_files (MySQL)
  • MDL-34271 improve mysql collation handling

MySQL only has localised collations that are case-insensitive

If you look at the list of MySQL collations, you will see that the only collations that give correctly locale-aware ordering are case-insensitive (the ci bit in utf8_czech_ci), and so suffer from the 'x' == 'X' and 'e' == 'é' problem. The only collation that does not have this problem is utf8_bin, which is not the order people want to see.

Outline solution

Difficulties with the proposed solution