Student projects/Global search
Note: You are currently viewing documentation for Moodle 1.9. Up-to-date documentation for the latest stable version is available here: Student projects/Global search.
Current Status: Zend Lucene updated 1.0.1, Revision 1042.
The Zend Framework has a port of the Apache Lucene project in PHP 5, and I am currently using this to implement searching in Moodle. Issues at the moment are mainly index quality and document updating/removing. Adding a single document when it's created reduces the index quality, and this can only be corrected by an index optimiser, which doesn't seem to have been ported to PHP yet. One solution would be to have a list of new/updated documents, and when it reaches x (x = 100 for e.g.) or the indexer hasn't been run for more than a day, the script runs. The optimiser can then be scheduled to run every week or so.
Speed/memory will be an inherent problem with PHP, and even if we develop our own system "in-house", I doubt we'll do any better than the Zend Framework team have done so far - that said the performance of the current code is pretty acceptable, see below.
Maybe we can implement a very basic temporary search system that only checks against documents not yet in the main index - in the case of the indexer only running every few days - so when someone searches the search page can say: "These documents might match your query: 1. 2., but they're not yet in the primary index." (using SELECT ... LIKE ...)
Just a quick update to get some better rough numbers out: It takes me 168.250 seconds to index all 5734 wiki pages, and 0.588 seconds to return 418 results for "css". This is on a 2.083GHz AMD with 1.5GB RAM, and Apache2 with PHP 5.1.4. The index files take up 2304kb on disk.
And 57 seconds to index the 1000-odd latest versions of the wiki pages, resulting in 1200kb of index files.
[ Valery Fremaux ] Lucene implementation was updated to 1.0.1 rev 1042. The update fixes major UTF8 problems the search engine had that avoided any use over 1.7.
A big bullet-point in this project is restricting search results to the correct permission group (a normal user will only get results that he/she can actually view, i.e. no admin pages, etc.). I've started working on this - Each result document object has the appropriate user/group/courseid stored with it, so hopefully that'll work to calculate whether or not it should be shown.
[ Valery Fremaux ] Code for trapping unallowed results has been completed for version 1.8 of Moodle. Additional Zend fields have been added to check the exact context of the resource against users (what context is actually opposed to users when they get it). The search engine records any access related context that may be needed by the specific capability check of each document search extension. Generic capabilities are resolved within the main result processing algorithm, while very specific capailities checks are providen by the module dependant search extension).
The database only contains one search related table at the moment, search_documents. This is just a summary of every document indexed, and contains the document's title, url, last-update-date and user/group/courseid. The table is synchronised with the index on disk (flat-file) using it's primary key (DB->search_document->id == index->document->dbid).
These are some settings that needed to be implemented in an admin page.
- Block Settings
- Block text
- Search Settings
- Default number of results per page
- How many documents before re-index
- Record result popularity [on/off]
This is an issue at the moment, because there is no index optimiser like Luke that has been implemented in PHP. ZSL 0.14 has introduced updating/deleting record methods, but adding new content 1 page at a time will really bog down the index files - so hopefully the optimiser will be implemented in ZSL soon.
Lazy- or eager-updates? Lazy is conducive to a regular cron update script, eager better for small cases (single page update); and eager probably not so good for re-indexing a restored backup? Must discuss this.
This should include at least all of the options available here: .
The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).
Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).
Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (like wiki_pages table has a 'hits' column).
Physical File Support
[ Valery Fremaux ] A new physical file support has been developped for the 1.8 update of the global search. This implementation supports major well-known document formats, through open-source document "to text" converters.
Supported formats are :
- MSWord/OpenOffice (DOC) format
- MSPowerPoint (PPT) format
- Adobe (PDF) format
- XML format (based on entity CDATA content)
- HTML format (based on HTML tags content)
- fulltext (TXT) format
Only ressource indexed as FILE type resources are indexed. Support for other attachements is not yet providen.