Database collation issue

Warning: This page is no longer in use. The information contained on the page should NOT be seen as relevant or reliable.

Database collation issue
Project state	Specification in progress
Tracker issue	MDL-36208
Discussion	https://moodle.org/mod/forum/discuss.php?d=214564
Assignee	Hero required

Background

Database store text data using a particular character set. Moodle uses UTF-8, which is a good thing, since UTF-8 can represent almost any character ever.

Text data is also stored with an associated collation. This affects how the values are sorted and compared. For example, different languages have different sort-orders, which depend on which alphabet they use. For example in German the ß character should come just after S (and s).

Collations typically come in two forms, case sensitive, and case-insensitive. If you are using a case-insensitive collation, then not only do 'x' and 'X' get sorted the same, but actually, the database thinks that 'x' = 'X'.

Operations like JOINs are typically much faster if both tables (columns) use the same collation.

The problems

Collation should depend on current language

On multi-lingual sites like moodle.org, different users will all expect reports to be sorted according to their language's collation rules. If we just specify the collation in database table definition, then all queries for all users have to use the same collation.

Typical bugs:

MDL-30446 Glossary order issues for some collation settings (Postgres)

Machine-readable identifiers

Moodle stores two sorts for text data:

user-visible strings like user.firstname, user.lastname or course.fullname
machine-readable identifiers like user.username, course.idnumber or file.filepath

For machine-readable identifiers, the 'x' == 'X', or worse 'e' == 'é' behaviour is really bad, particularly if you want to put a unique index on the column in question. It has lead to bugs like:

MDL-29332 Calculated Questions Require Case Sensitive Data Tables (MySQL)
MDL-33655 MySQL case insensitivity breaks get_directory_files (MySQL)
MDL-34271 improve mysql collation handling

Human-readable case-sensitive strings

MDL-22117 Units in numerical questions are acting case sensitive, but not functioning that way

MySQL only has localised collations that are case-insensitive

If you look at the list of MySQL collations, you will see that the only collations that give correctly locale-aware ordering are case-insensitive (the ci bit in utf8_czech_ci), and so suffer from the 'x' == 'X' and 'e' == 'é' problem. The only collation that does not have this problem is utf8_bin, which is not the order people want to see.

Outline solution

The proposed solution is simple in outline:

Store all data in database tables that use a case-sentistive collation (e.g. utf8_bin).
Change all ORDER BY clauses in all SQL that sort text, to specify the correct sort collation, based on the current language.

To support that second part, we will need some helper code, for example, each language pack must specify the right collation to use, for each supported database, for example using new strings in langconfig.php. We will also need an $DB->sql_order_by_text($column) helper function to return the right SQL fragment like "columnname COLLATE utf8_czech_ci".

Difficulties with the proposed solution

Transitional period (minor)

In the transitional period, when the table definitions have been updated to use a collation like utf8_bin, but before we have found and fixed all SQL queries in Moodle core (and in all third-party plugins) to use sql_order_by_text, then the worst that can happen is that the results of some reports appear in a slightly odd order for some users. The reports will still basically work.

Whenever a problem like this is detected, it will be easy to add the necessary sql_order_by_text call to the code to fix it.

Time to do the upgrade (moderate)

The amount of time required to upgrade all tables to a new collation is going to be comparable to the time required for the change that changed all unsigned int columns to signed. This was an issue for some large sites.

Like dropping unsigned columns, this is probably a necessary evil that we can live with.

Backwards compatibility of identifiers (severe)

This is the most serious problem. It affects areas like:

What are valid user names? Are user names case-sensitive.
For people using integrations like the database authentication and enrolment plugins, or CVS import, it may be that at the moment, the identifiers currently match using case-insensitive collation, but if we changed the database collation to utf8_bin, then things would not longer match. This could cause a real mess.

So, we may have to add an admin option for these areas, where admins can choose if they want case-sensitive or case-insenstivive collations. Then, if they choose case-insensitive collation for backwards compatibility, we will have to add explicit LOWER(...) or COLLATE calls to the code, to maintain the current behaviour.

Here the risk is not just wrong sorting of reports, but it could badly break sites data integration. Hence, the risks if we get this wrong is very severe.

We do not yet have a good solution to this problem.

Documentation