Note:

This site is no longer used and is in read-only mode. Instead please go to our new Moodle Developer Resource site.

Database collation issue: Difference between revisions

From MoodleDocs
Created page with "{{Work in progress}} {{Infobox Project |name = Database collation issue |state = Specification in progress |tracker = TBA |discussion = TBA |assignee = }} == The problem == ..."
 
No edit summary
Line 8: Line 8:
}}
}}


== The problem ==
== Background ==


Database store text data using a particular '''character set'''. Moodle uses UTF-8, which is a good thing, since UTF-8 can represent almost any character ever.


Text data is also stored with an associated '''collation'''. This affects how the values are sorted and compared. For example, different languages have different sort-orders, which depend on which alphabet they use. For example in German the '''ß''' character should come just after '''S''' (and '''s''').


=== Related bugs ===
Collations typically come in two forms, case sensitive, and case-insensitive. If you are using a case-insensitive collation, then not only do 'x' and 'X' get sorted the same, but actually, the database thinks that 'x' = 'X'.


Operations like JOINs are typically much faster if both tables (columns) use the same collation.
== The problems ==
=== Collation should depend on current language ===
On multi-lingual sites like [https://moodle.org/ moodle.org], different users will all expect reports to be sorted according to their language's collation rules. If we just specify the collation in database table definition, then all queries for all users have to use the same collation.
Typical bugs:
* MDL-30446 Glossary order issues for some collation settings (Postgres)
=== Machine-readable identifiers ===
Moodle stores two sorts for text data:
* user-visible strings like user.firstname, user.lastname or course.fullname
* machine-readable identifiers like user.username, course.idnumber or file.filepath
For machine-readable identifiers, the 'x' == 'X', or worse 'e' == 'é' behaviour is really bad, and has lead to bugs, for example:
* MDL-29332 Calculated Questions Require Case Sensitive Data Tables (MySQL)
* MDL-29332 Calculated Questions Require Case Sensitive Data Tables (MySQL)
* MDL-33655 MySQL case insensitivity breaks get_directory_files (MySQL)
* MDL-33655 MySQL case insensitivity breaks get_directory_files (MySQL)
* MDL-30446 Glossary order issues for some collation settings (Postgres)
* MDL-34271 improve mysql collation handling
* MDL-34271 improve mysql collation handling


=== MySQL only has localised collations that are case-insensitive ===
If you look at the list of [https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html MySQL collations], you will see that the only collations that give correctly locale-aware ordering are case-insensitive (the ci bit in utf8_czech_ci), and so suffer from the 'x' == 'X' and 'e' == 'é' problem. The only collation that does not have this problem is utf8_bin, which is not the order people want to see.


== Outline solution ==
== Outline solution ==




=== Difficulties ===
 
== Difficulties with the proposed solution ==
 
 
 
[[Category:Project]]

Revision as of 07:01, 25 October 2012

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Database collation issue
Project state Specification in progress
Tracker issue TBA
Discussion TBA
Assignee


Background

Database store text data using a particular character set. Moodle uses UTF-8, which is a good thing, since UTF-8 can represent almost any character ever.

Text data is also stored with an associated collation. This affects how the values are sorted and compared. For example, different languages have different sort-orders, which depend on which alphabet they use. For example in German the ß character should come just after S (and s).

Collations typically come in two forms, case sensitive, and case-insensitive. If you are using a case-insensitive collation, then not only do 'x' and 'X' get sorted the same, but actually, the database thinks that 'x' = 'X'.

Operations like JOINs are typically much faster if both tables (columns) use the same collation.

The problems

Collation should depend on current language

On multi-lingual sites like moodle.org, different users will all expect reports to be sorted according to their language's collation rules. If we just specify the collation in database table definition, then all queries for all users have to use the same collation.

Typical bugs:

  • MDL-30446 Glossary order issues for some collation settings (Postgres)

Machine-readable identifiers

Moodle stores two sorts for text data:

  • user-visible strings like user.firstname, user.lastname or course.fullname
  • machine-readable identifiers like user.username, course.idnumber or file.filepath

For machine-readable identifiers, the 'x' == 'X', or worse 'e' == 'é' behaviour is really bad, and has lead to bugs, for example:

  • MDL-29332 Calculated Questions Require Case Sensitive Data Tables (MySQL)
  • MDL-33655 MySQL case insensitivity breaks get_directory_files (MySQL)
  • MDL-34271 improve mysql collation handling

MySQL only has localised collations that are case-insensitive

If you look at the list of MySQL collations, you will see that the only collations that give correctly locale-aware ordering are case-insensitive (the ci bit in utf8_czech_ci), and so suffer from the 'x' == 'X' and 'e' == 'é' problem. The only collation that does not have this problem is utf8_bin, which is not the order people want to see.

Outline solution

Difficulties with the proposed solution