Development:Backup 2.0 - Improve XML parsing: Difference between revisions

Revision as of 14:22, 2 March 2009

Development:Backup 2.0 -> Improve XML parsing

Note: This article is a work in progress. Please use the page comments or an appropriate moodle.org forum for any recommendations/suggestions for improvement.

Template:Moodle 2.0

Original: https://docs.moodle.org/en/Development:Backup 2.0 - Improve XML parsing

Summary

Current (Moodle 1.x) restore uses too much memory parsing some "parts" of the XML information. We need to change current approach to one providing optimal memory usage and acceptable throughput. In this page you'll find the different alternatives researched, with their strengths and weaknesses. Finally one solution will be provided in order to be implemented in new Moodle 2.0 backup & restore. Also, if possible, some changes will be performed in Moodle 1.9 in order to achieve better results and solve bugs like MDL-14302, MDL-15489, MDL-9838...).

Research

Below, you will find some information about different methods used to perform the XML parsing of one 12.5MB file, corresponding to one real quiz module with only 788 attempts and 20115 states (answers). It has been proved to be the problematic "part" in one production server using current Moodle 1.9.x (the file isn't available here for privacy matters, obviously).

Each method, in order to be considered valid must fulfil these basic objectives:

Parse the XML file.
Provide one in-memory object with all the needed info in order to be processed later by the corresponding restore plugin.

Eloy, is that second one really necessary? Can't we find a way to process the data as we parse it? Mind you, a quick back of the envelope calculation shows that 12.5MB of XML is extreme from one quiz, so if Method 2 is that good, perhaps we can by lazy and load each activity into memory one at a time.--Tim Hunt 19:05, 1 March 2009 (CST)

Hi Tim, I'm analysing more things than simply the memory/speed considerations of current parsing. If I started with that, it's because of current bugs preventing people to restore 1.9 courses and wanted to prospect that ASAP. For 1.9 there is no chance to change the architecture, but perhaps we could use Method 2 or Method 5 selectively when restoring quizzes. About 2.0, the more I think on it, the more I'm about to split current monolithic moodle.xml into smaller piezes. That will cause another immediate memory reduction and speedup (it's order of magnitude more efficient to process 20*1MB files than 1*20MB file). Anyway, I'm not sure if we can made the whole process of parsing and restoring a pure-SAX process, because, for example, the attempt must be created BEFORE the states and, being formal, if we follow the SAX approach, the attempt tag hasn't been closed, hence, we haven't created it. So, perhaps we'll need to continue loading 1 module into memory. Luckly Method 2 looks really nice, lightspeed, compressed enough and cheap in memory usage). Let's see how the thing evolves, thanks! --Eloy Lafuente (stronk7) 08:22, 2 March 2009 (CST)

For each method, some common information bits are provided (to allow comparisons later).

name: mnemonic to easily reference each method later in comments.
file size: the size of the original XML file being parsed.
memory: max memory used by PHP in the execution (provided by memory_get_peak_usage()).
time: time required to perform the parsing to memory (measured in seconds).
data size: size of the in-memory generated object (final result of the execution).
data format: specifications of the in-memory generated object (xmlize compatible, custom...).

Table of Results

This table summarises the raw results obtained running each method, providing links to details about each of them.

Method	File size	Memory	Time	Data size	Format
Method 0: Current behaviour (xmlize)	14.5MB	311.3MB	4.8 seconds	14MB	xmlize
Method 1: SimpleXML parsing + conversion to xmlize format	14.5MB	165.8MB	5.2 seconds	14MB	xmlize
Method 2: Method 2: SimpleXML parsing, no conversion	14.5MB	36.5MB	0.5 seconds	8.5MB	simplexml
Method 3: Custom SAX parser + conversion to xmlize format	14.5MB	158.3MB	47.5 seconds	14MB	xmlize
Method 4: Custom SAX parser + conversion to xmlize-reduced format	14.5MB	72MB	51.9 seconds	7.7MB	xmlize-reduced
Method 5: Custom SAX parser + conversion to custom (simple) format	14.5MB	64.5MB	15.1 seconds	5.4MB	custom-simple

The alternatives analysed

Here each of the alternatives is analysed and compared

Method 0: Current behaviour (xmlize)

Summary: file size: 12.5MB, memory: 311.3MB, time: 4.8 seconds, data size: 14MB, data format: xmlize format

Method 1: SimpleXML parsing + conversion to xmlize format

Summary: file size: 12.5MB, memory: 165.8MB, time: 5.2 seconds, ata size: 14MB, data format: xmlize format

Method 2: SimpleXML parsing, no conversion (use simplexml as final object)

Summary: file size: 12.5MB, memory: 36.5MB, time: 0.5 seconds, data size: 8.5MB, data format: simplexml format

Method 3: Custom SAX parser + conversion to xmlize format

Summary: file size: 12.5MB, memory: 158.3MB, time: 47.5 seconds, data size: 14MB, data format: xmlize format

Method 4: Custom SAX parser + conversion to xmlize-reduced format

Summary: file size: 12.5MB, memory: 72MB, time: 51.9 seconds, data size: 7.7MB, data format: xmlize-reduced format

Method 5: Custom SAX parser + conversion to custom (simple) format

Summary: file size: 12.5MB, memory: 64.5MB, time: 15.1 seconds, data size: 5.4MB, data format: custom simple format

Formats

In this section there are some explanations about the different object formats generated by the different methods:

XMLize format

SimpleXML format

XMLize-reduced format

Custom-simple format

Code

Here its' the code used for the different parsing methods commented above:

Method 0: Current behaviour

$contents = file_get_contents('moodle.xml'); $data = xmlize($contents);

@@ Line 18: / Line 18: @@
 * '''Provide one in-memory object''' with all the needed info in order to be processed later by the corresponding restore plugin.
-:: ''Eloy, is that second one really necessary? Can't we find a way to process the data as we parse it? Mind you, a quick back of the envelope cacluation shows that 12.5MB of XML is extreme from one quiz, so if Method 2 is that good, perhaps we can by lazy and load each activity into memory one at a time.''--[[User:Tim Hunt|Tim Hunt]] 19:05, 1 March 2009 (CST)
+:: ''Eloy, is that second one really necessary? Can't we find a way to process the data as we parse it? Mind you, a quick back of the envelope calculation shows that 12.5MB of XML is extreme from one quiz, so if Method 2 is that good, perhaps we can by lazy and load each activity into memory one at a time.''--[[User:Tim Hunt|Tim Hunt]] 19:05, 1 March 2009 (CST)
+::: ''Hi Tim, I'm analysing more things than simply the memory/speed considerations of current parsing. If I started with that, it's because of current bugs preventing people to restore 1.9 courses and wanted to prospect that ASAP. For 1.9 there is no chance to change the architecture, but perhaps we could use Method 2 or Method 5 selectively when restoring quizzes. About 2.0, the more I think on it, the more I'm about to '''split''' current monolithic moodle.xml into smaller piezes. That will cause another immediate memory reduction and speedup (it's order of magnitude more efficient to process 20*1MB files than 1*20MB file). Anyway, I'm not sure if we can made the whole process of parsing and restoring a pure-SAX process, because, for example, the attempt must be created BEFORE the states and, being formal, if we follow the SAX approach, the attempt tag hasn't been closed, hence, we haven't created it. So, perhaps we'll need to continue loading 1 module into memory. Luckly Method 2 looks really nice, lightspeed, compressed enough and cheap in memory usage). Let's see how the thing evolves, thanks!'' --[[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 08:22, 2 March 2009 (CST)
 For each method, some common information bits are provided (to allow comparisons later).

Documentation