Development:XMLDB problems

From MoodleDocs

XML database schema > XMLDB Problems


Every section on this page will show important problems present in the adoption of the new database layer planned for 1.7. Each section will define the problem, its impact and, when available, its solution. Every section will include one line containing its status ('Open', 'Decided, 'Work in progress', 'Finished', 'Tested' and 'Closed').

Of course, fell free to use the Bug Tracker to follow the daily evolution of each one of these problems by adding one link inside each section pointing to the proper bug.

Prior to change the status from 'Decided' to 'Work in progress' the corresponding page of the roadmap must be modified, including all the actions/steps needed to solve the problem. Obviously, at the end, the expected status for each section is 'Closed'.

Regular expressions

(Status: Open - Severity: Critical - Bug: not defined)

Problems with Oracle (not available until 10g) and MSSQL (not available). Oracle 10g implements them using directly and one package existed since ages (Oracle 8i?) to handle them (owa_pattern). MSSQL can execute them by installing some stored procedures. Oh, oh, problems with some (a few) queries...

SQL Limit performance

(Status: Open - Severity: Not critical - Bug: not defined)

Analyse the impact of such SelectLimit() calls under SQL*Server, Oracle... because it's emulated by ADOdb on those DBs, because their lack of support for the LIMIT clause. As the offset parameter grows, ADOdb must iterate over more records to get the desired window and it could be a problem under long sets of records!

Naming conventions

(Status: Open - Severity: Critical - Bug: not defined)

What to do with all the previously created DB objects (indexes, unique indexes, sequences...) if their naming schema doesn't fit with the implemented by ADOdb. Drop/recreate everything? Leave it unmodified? Can we use different naming schemes for each DB. PostgreSQL one seems more formal, is it the way for Oracle and SQL*Server. Also I would propose to add some option to the Health Center in order to be able to drop all the indexes and regenerate them following the exact naming rules specified in the XMLDB naming page.

Naming conventions II

(Status: Closed - Severity: Critical - Bug: 6230)

Oracle imposes one limit of 30cc. for naming DB objects (table, field, index...), and currently we have some objects over such limit: mdl_glossary_entries_categories and mdl_question_dataset_definitions. They must be renamed to anything else ASAP. No alternate solution exists. Also, once renamed we should restrict the maximum length of $CFG->prefix to avoid undesired situations, at least for new installations to, say, 5cc, limiting table names to 25cc (without the prefix). PostgreSQL before 7.4 limit seems quite near too (32 cc.).

Solution: Limit table/field/index/key/sequence names to 30cc and deny the use of prefixes for Oracle installations. See key and index naming to get info about how XMLDB constructors will be naming all the remaining objects in DB.

NOT NULL fields using a DEFAULT '' clause

(Status: Open - Severity: Critical - Bug: 6218 )

Under Moodle 1.6 there are tons of char/varchar/text/blob columns that are defined as NOT NULL and contains one DEFAULT clause with the value '' (empty string).

There are two important problems with this type of definitions:

  1. They haven't to much sense from a design perspective. If one field is defined as NOT NULL it shouldn't contain empty values (yep, agree that NULL != everything, included empty, but in a logical world I cannot find any reason to fill fields with empty values).
  2. Oracle considers empty strings as NULL values the DEFAULT application will crash in that RDBMS. (http://www.techonthenet.com/oracle/questions/empty_null.php)

The proper way to solve this situation should be to annihilate all those inconsistent/wrong combinations from the whole DB schema (NOT NULL + DEFAULT ''), but it can be really hard and risky to do it now (Moodle 1.6-1.7). So here it's the proposed solution:

  1. Don't modify 1.6 at all. Everything should continue working properly.
  2. In the XMLDB files, for 1.7, use the correct approach, i.e. all those fields won't have a default value in the XML structure, although they'll continue as NOT NULL there.
  3. In the XMLDB generator (create table), for MySQL, PostgreSQL and MSSQL, add the DEFAULT clause for all the char/text columns that haven't a DEFAULT defined in the XML structure.
  4. Add some check in the installation script to look for that MySQL modes being problematic.

With this approach we'll have MySQL and PostgreSQL servers running exactly without modification in 1.7, with MSSQL working in the same manner, and Oracle version would be more correct, from the DB perspective, although sure it won't work in a lot of places. Also, if the Oracle isn't usable at all, we can trick it a bit, both declaring all those fields directly as NULL (+1 for it) or adding one dummy default value like one whitespace. They aren't perfect solutions but should improve a bit the Oracle experience, although, under 1.7, it's only going to be ALPHA-BETA-quality (too much changes are required across all the code to solve the NULL/NOT NULL/DEFAULT problem, and time for 1.7 is really short).

After 1.7, we should start to modify MySQL, PostgreSQL and MSSQL implementation to use the same (correct) field definition (nullable columns) with more time to change/test/fix everything, progressively moving NOT NULL stuff to NULL if it's really nullable, dropping all those empty defaults.

Indexes on TEXT columns

(Status: Open - Severity: Not Critical - Bug: not defined)

Each RDBMS implementation has completely different mechanisms in order to create FULLTEXT indexes against TEXT columns. Also, the SQL commands to perform searches against such columns are pretty different (incompatible). This forces us to avoid TEXT column indexing for now. Here there are the list of fields we have changed in the transition to the XMLDB schema to avoid such indexing:

Also, it's important to differentiate between FULLTEXT indexes (where we are going to index and search freely any text content) and FUNCTION indexes where we just index the results of the execution of one function. This last indexes aren't a suitable replacement for TEXT indexes, and aren't used at all if the EXACT" function call used to define the index isn't used by the SQL query.

In the long term, the XMLDB schema should be extended to support this type of indexes to perform better fulltext searches. At the time of writing this, some efforts are being performed by the Global search project, part of the Student projects.

Finally, note that a lot of RDBMS flavours aren't able to keep all those FULLTEXT indexes updated online all they need to be synchronised/optimised continuously by "external" agents (stored procedures, command line executables, cron tasks...).

Some interesting links about FULLTEXT indexing/search/maintain:

MSSQL, PHP, UTF-8 and UCS-2

(Status: Finished - Severity: Critical - Bug: not defined)

Practically all the modern RDBMS support more than one encoding to be used, both to store information in such encodings or to handle data between clients and servers with automatic conversions. More properly, all them support natively the UTF-8 encoding that is currently used by practically all the web applications, mobile widgets and all sort of devices.

But there is one notable exception, SQL*Server. It exclusively supports the UCS-2 encoding (another Unicode flavour) to be used against their special nchar columns. Worse is the fact that Microsoft doesn't offer any reliable solution for 3rd part clients inside their plethora of connectivity solutions (ODBC, OLE, COM...) to transparently perform the conversion from/to UCS-2 and the desired encoding (UTF-8 for Moodle).

You can see the solutions suggested by Microsoft in: http://support.microsoft.com/kb/232580/EN-US/ :-P

Solutions provided in that article, although could help to some people, aren't ok for Moodle. We need to use PHP and be able to send and receive UTF-8 data from/to the server, storing it properly in the server side (in UCS-2) to be able to handle it properly.

After a lot of time googling the net it seems that the basis of the problem is the called DBLIB library (the dll file has one unpronounceable name). This library is used by practically all the DB clients to build their interfaces. PHP and others rely on this library to build their own modules of access to SQL*Server databases. And this base library lacks UTF-8 support (or at least it doesn't offer conversion from/to UTF-8).

With this on mind, a research of alternative solutions began. Any solution should fulfil these characteristics:

  • Convert transparently from UTF-8 (PHP and Moodle) to UCS-2 (SQL*Server) so everything was properly stored on each side.
  • Obviously, be supported by PHP 4.3.x upwards (min req. for Moodle).
  • Be supported by ADOdb, our Database Abstraction Library for PHP.
  • Be available both for Unix and Windows environments.
  • If possible, be able to replace the standard php_mssql extension that doesn't support the automatic conversion. This would be great because mssql support from ADOdb is pretty good (better that other alternatives like COM, OLE...).

And finally, they exist! We've found two products (OpenSource, of course) that fulfil all the requirements detailed above. Both perform the automatic conversion between UTF-8 and UCS-2, replace the standard php_mssql library so we'll continue using the well supported mssql ADOdb driver and are cross-platform. Anything more, nah, here they are:

  • FreeDTS: This a 100% client product that it's based in the [:wiki:Tabular_Data_Stream Tabular Data Protocol] supported both SQL*Server and Sybase. We have compiled and used it both under Linux and MacOS X without problems and with one minimal configuration it replaces the standard php_mssql extension smoothly. The documentation says it works also under Win32 systems but we haven't tried it there. Tons of reports in Internet talk about it as a reliable solution.
  • ODBTP: This is another protocol, Open Database Transport Protocol, that allows to connect virtually to any Win32 DB server. It requires both one client installation and one service to be installed in one Win32 machine (to act as bridge between the client and the DB, via ODBC). The client part is available for Linux and MacOS X (not tested) and also implements one replacement for the standard php_mssql extension. The service part is obviously, available only for Win32. After a minimal configuration it works perfectly and it's announced as reliable for critical missions.

(perhaps some HOWTO about installation could help here, plus some direct download links, although it isn't really difficult to configure/compile/install them)

Reserved Words in Moodle DB

(Status: Work in Progress - Severity: Critical - Bug: [https://tracker.moodle.org/browse/MDL-6307 MDL-6307])

Some of the DB names used in Moodle 1.6 are reserverd words (words that are forbidden to be used as identifiers for DB objects). The list of currently detected Reserved Words is:

  • assignment_submissions->comment (oracle)
  • forum->open (mssql)
  • glossary_comments->comment (oracle)
  • journal_entries->comment (oracle)
  • question_dataset_items->number (oracle)
  • question_sessions->comment (oracle) Done! (question_sessions->manualcomment)
  • resource (oci8po)
  • timezone->rule (mssql) Done! (timezone->tzrule)
  • user (mssql, oracle, postgres)

(the format of the list is table->field (reserved on DB)

Note that reserved words aren't dangerous if we are using prefixes for tables (so we should force prefix usage under Moodle 1.7 for NEW installations, 1-2cc for oracle and 1-10cc for the rest).

ADOdb fetch mode (ASSOC, NUM and BOTH)

(Status: Work in Progress - Severity: Critical - Bug: not defined)

By default we use ADOdb under FETCH_BOTH mode. This returns mixed recordsets with both numeric keys (0, 1, 2...) and literal keys (the name of the fields). And then, we get an associative array from such recordset.(for all the get_records_XXX functions). Let's see one example:

If we have one table, call it "people" with three columns (id, name, address) and with this data:

   666, 'Bill', 'Medina
   777, 'Steve', 'Palo Alto'

So, one simple query like this:

 SELECT id, name, address
 FROM people;

Will return one recordset with this structure in PHP:

 array (
     [0] => stdClass Object (
         [0] = 666
         [id] = 666
         [1] = Bill
         [name] = Bill
         [2] = Medina
         [address] = Medina}
     [1] => stdClass Object (
         [0] = 777
         [id] = 777
         [1] = Steve
         [name] = Steve
         [2] = Palo Alto
         [address] = Palo Alto)
 )

And, when we get the associative array it will be transformed to:

 array (
     [666] => stdClass Object (
         [id] = 666
         [1] = Bill
         [name] = Bill
         [2] = Medina
         [address] = Medina}
     [777] => stdClass Object (
         [id] = 777
         [1] = Steve
         [name] = Steve
         [2] = Palo Alto
         [address] = Palo Alto)
 )

i.e. the first attribute of the original recordset is used to build the key of the final array, disappearing from the array itself. In our example above, the first column in the recordset was the index [0].Their values (666, 777) have been used as key in the associative array and it has disappeared from it.

We have been using this mechanism to work with associative arrays inside Moodle, where both the key and the [id] element were pointing to the same value, being possible to handle both fields. But this trick has two major drawbacks:

  1. All the data is transmitted and stored twice. As you see in the original recordset, the data of all the fields is duplicated, while we aren't using the [1], [2]... indexes at all.
  2. Although the order until now, both for MySQL and PostgreSQL always retrieve the numeric indexes first, and then the named indexes, some drivers invert this order, returning the named indexes first and then the numerical ones. And this have highly negative consequences in the the final structures that are returned by the get_records_XXX() functions, losing completely the [id] key, that is used across all Moodle.

So, this solution is proposed:

  • In order to solve (2) ASAP, we'll duplicate the first field of each record in the recordset before calling to GetAssoc(). With this we guarantee that nothing is lost and the field becoming key in the associative array will continue in the record.
  • After 1.7, we should change our approach from FETCH_BOTH to FETCH_ASSOC to solve (1), retrieving/handling half the data are currently used. All the places using the numerical keys will be updated (although they should be practically non-existent).

See also