Difference between revisions of "Backup 2.0 - Improve XML parsing"

Jump to: navigation, search
m (Research)
m (Table of Results)
Line 54: Line 54:
 
** With that split performed it will be easier to support different "envelopes" for the same modules (section backup/course backup/1 activity backup...).
 
** With that split performed it will be easier to support different "envelopes" for the same modules (section backup/course backup/1 activity backup...).
 
** Analyse the possibility to, in a declarative way, be able to define the whole backup & restore process. One XML to specify the whole thing would be amazing... prospect possibilities. Different from install.xml for sure (just in case somebody is thinking about that) (see [[Development talk:Backup 2.0|talk page]]).
 
** Analyse the possibility to, in a declarative way, be able to define the whole backup & restore process. One XML to specify the whole thing would be amazing... prospect possibilities. Different from install.xml for sure (just in case somebody is thinking about that) (see [[Development talk:Backup 2.0|talk page]]).
** Allow modules to request "sub-parts" of the whole module.xml file. While we can accept each module as a whole (an load it straight to memory) some modules (quiz) are big enough to make this approach to cause problems. Should be able to request "different sub-parts" using some simple mini-API. (quiz example: "giveme the quiz module data", or "iterate over the attempts in a nice loop"...). (see [[Development talk:Backup 2.0 - Improve XML parsing|talk page]]).
+
** Allow modules to request "sub-parts" of the whole module.xml file. While we can accept each module as a whole (an load it straight to memory) some modules (quiz) are big enough to make this approach to cause problems (see MDL-15489). Should be able to request "different sub-parts" using some simple mini-API. (quiz example: "giveme the quiz module data", or "iterate over the attempts in a nice loop"...). (see [[Development talk:Backup 2.0 - Improve XML parsing|talk page]]).
 +
** Modules implementing own parser?
 +
** Backport or analyse some interim solution for 1.9 quizzes.
  
 
=== The alternatives analysed ===
 
=== The alternatives analysed ===

Revision as of 23:13, 6 March 2009

Backup 2.0 -> Improve XML parsing

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Moodle 2.0


Original: https://docs.moodle.org/en/Development:Backup 2.0 - Improve XML parsing

Summary

Current (Moodle 1.x) restore uses too much memory parsing some "parts" of the XML information. We need to change current approach to one providing optimal memory usage and acceptable throughput. In this page you'll find the different alternatives researched, with their strengths and weaknesses. Finally one solution will be provided in order to be implemented in new Moodle 2.0 backup & restore. Also, if possible, some changes will be performed in Moodle 1.9 in order to achieve better results and solve bugs like MDL-14302, MDL-15489, MDL-9838...).

Research

Below, you will find some information about different methods used to perform the XML parsing of one 12.5MB file, corresponding to one real quiz module with only 788 attempts and 20115 states (answers). It has been proved to be the problematic "part" in one production server using current Moodle 1.9.x (the file isn't available here for privacy matters, obviously).

Each method, in order to be considered valid must fulfil these basic objectives:

  • Parse the XML file.
  • Provide one in-memory object with all the needed info in order to be processed later by the corresponding restore plugin. Discussion of this point moved to the talk page.

For each method, some common information bits are provided (to allow comparisons later).

  • name: mnemonic to easily reference each method later in comments.
  • file size: the size of the original XML file being parsed.
  • memory: max memory used by PHP in the execution (provided by memory_get_peak_usage()).
  • time: time required to perform the parsing to memory (measured in seconds).
  • data size: size of the in-memory generated object (final result of the execution).
  • data format: specifications of the in-memory generated object (xmlize compatible, custom...).

Table of Results

This table summarises the raw results obtained running each method, providing links to details about each of them.

Method File size Memory Time Data size Format
Method 0: Current behaviour (xmlize) 12.5MB 311.3MB 4.8 seconds 14MB xmlize
Method 1: SimpleXML parsing + conversion to xmlize format 12.5MB 165.8MB 5.2 seconds 14MB xmlize
Method 2: Method 2: SimpleXML parsing, no conversion 12.5MB 36.5MB 0.5 seconds 8.5MB simplexml
Method 3: Custom SAX parser + conversion to xmlize format 12.5MB 158.3MB 47.5 seconds 14MB xmlize
Method 4: Custom SAX parser + conversion to xmlize-reduced format 12.5MB 72MB 51.9 seconds 7.7MB xmlize-reduced
Method 5: Custom SAX parser + conversion to custom (simple) format 12.5MB 64.5MB 15.1 seconds 5.4MB custom-simple
  • Temp Note: To avoid forgetting to talk about :
    • Improve4memory and improve4speed
    • SimpleXML in core (possibility to provide transformation to xmlize for easier contrib code upgrade of backup/restore.
    • Split moodle.xml into multiple files (one for each current "TODO" - part in Moodle 1.9 + 1 per module). Some of this has been worked out/tested in MDL-18468 showing important speed benefits.
    • With that split performed it will be easier to support different "envelopes" for the same modules (section backup/course backup/1 activity backup...).
    • Analyse the possibility to, in a declarative way, be able to define the whole backup & restore process. One XML to specify the whole thing would be amazing... prospect possibilities. Different from install.xml for sure (just in case somebody is thinking about that) (see talk page).
    • Allow modules to request "sub-parts" of the whole module.xml file. While we can accept each module as a whole (an load it straight to memory) some modules (quiz) are big enough to make this approach to cause problems (see MDL-15489). Should be able to request "different sub-parts" using some simple mini-API. (quiz example: "giveme the quiz module data", or "iterate over the attempts in a nice loop"...). (see talk page).
    • Modules implementing own parser?
    • Backport or analyse some interim solution for 1.9 quizzes.

The alternatives analysed

Here each of the alternatives is analysed and compared

Method 0: Current behaviour (xmlize)

Summary: file size: 12.5MB, memory: 311.3MB, time: 4.8 seconds, data size: 14MB, data format: xmlize format

Method 1: SimpleXML parsing + conversion to xmlize format

Summary: file size: 12.5MB, memory: 165.8MB, time: 5.2 seconds, ata size: 14MB, data format: xmlize format

Method 2: SimpleXML parsing, no conversion (use simplexml as final object)

Summary: file size: 12.5MB, memory: 36.5MB, time: 0.5 seconds, data size: 8.5MB, data format: simplexml format

Method 3: Custom SAX parser + conversion to xmlize format

Summary: file size: 12.5MB, memory: 158.3MB, time: 47.5 seconds, data size: 14MB, data format: xmlize format

Method 4: Custom SAX parser + conversion to xmlize-reduced format

Summary: file size: 12.5MB, memory: 72MB, time: 51.9 seconds, data size: 7.7MB, data format: xmlize-reduced format

Method 5: Custom SAX parser + conversion to custom (simple) format

Summary: file size: 12.5MB, memory: 64.5MB, time: 15.1 seconds, data size: 5.4MB, data format: custom simple format

Formats

In this section there are some explanations about the different object formats generated by the different methods:

XMLize format

SimpleXML format

XMLize-reduced format

Custom-simple format

Code

Here its' the code used for the different parsing methods commented above:

Method 0: Current behaviour

$contents = file_get_contents('moodle.xml');
$data = xmlize($contents);