Student projects/Global search: Difference between revisions

Latest revision as of 19:57, 7 September 2007

Current Status: Zend Lucene updated 1.0.1, Revision 1042.

Zend Lucene

The Zend Framework has a port of the Apache Lucene project in PHP 5, and I am currently using this to implement searching in Moodle. Issues at the moment are mainly index quality and document updating/removing. Adding a single document when it's created reduces the index quality, and this can only be corrected by an index optimiser, which doesn't seem to have been ported to PHP yet. One solution would be to have a list of new/updated documents, and when it reaches x (x = 100 for e.g.) or the indexer hasn't been run for more than a day, the script runs. The optimiser can then be scheduled to run every week or so.

Speed/memory will be an inherent problem with PHP, and even if we develop our own system "in-house", I doubt we'll do any better than the Zend Framework team have done so far - that said the performance of the current code is pretty acceptable, see below.

Maybe we can implement a very basic temporary search system that only checks against documents not yet in the main index - in the case of the indexer only running every few days - so when someone searches the search page can say: "These documents might match your query: 1. 2., but they're not yet in the primary index." (using SELECT ... LIKE ...)

Lucene Performance

Just a quick update to get some better rough numbers out: It takes me 168.250 seconds to index all 5734 wiki pages, and 0.588 seconds to return 418 results for "css". This is on a 2.083GHz AMD with 1.5GB RAM, and Apache2 with PHP 5.1.4. The index files take up 2304kb on disk.

And 57 seconds to index the 1000-odd latest versions of the wiki pages, resulting in 1200kb of index files.

[ Valery Fremaux ] Lucene implementation was updated to 1.0.1 rev 1042. The update fixes major UTF8 problems the search engine had that avoided any use over 1.7.

Permissions

A big bullet-point in this project is restricting search results to the correct permission group (a normal user will only get results that he/she can actually view, i.e. no admin pages, etc.). I've started working on this - Each result document object has the appropriate user/group/courseid stored with it, so hopefully that'll work to calculate whether or not it should be shown.

[ Valery Fremaux ] Code for trapping unallowed results has been completed for version 1.8 of Moodle. Additional Zend fields have been added to check the exact context of the resource against users (what context is actually opposed to users when they get it). The search engine records any access related context that may be needed by the specific capability check of each document search extension. Generic capabilities are resolved within the main result processing algorithm, while very specific capailities checks are providen by the module dependant search extension).

Database

The database only contains one search related table at the moment, search_documents. This is just a summary of every document indexed, and contains the document's title, url, last-update-date and user/group/courseid. The table is synchronised with the index on disk (flat-file) using it's primary key (DB->search_document->id == index->document->dbid).

Admin page

These are some settings that needed to be implemented in an admin page.

Block Settings
1. Block text

Search Settings
1. Default number of results per page
2. How many documents before re-index
3. Record result popularity [on/off]

Updating content

This is an issue at the moment, because there is no index optimiser like Luke that has been implemented in PHP. ZSL 0.14 has introduced updating/deleting record methods, but adding new content 1 page at a time will really bog down the index files - so hopefully the optimiser will be implemented in ZSL soon.

Lazy- or eager-updates? Lazy is conducive to a regular cron update script, eager better for small cases (single page update); and eager probably not so good for re-indexing a restored backup? Must discuss this.

Query

This should include at least all of the options available here: [1].

The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).

Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).

Metadata

Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (like wiki_pages table has a 'hits' column).

Physical File Support

[ Valery Fremaux ] A new physical file support has been developped for the 1.8 update of the global search. This implementation supports major well-known document formats, through open-source document "to text" converters.

Supported formats are :

MSWord/OpenOffice (DOC) format
MSPowerPoint (PPT) format
Adobe (PDF) format
XML format (based on entity CDATA content)
HTML format (based on HTML tags content)
fulltext (TXT) format

Only ressource indexed as FILE type resources are indexed. Support for other attachements is not yet providen.

@@ Line 1: / Line 1: @@
-'''Current Status''': Research and planning, investigating Zend Lucene.
+'''Current Status''': Zend Lucene updated 1.0.1, Revision 1042.
 == Zend Lucene ==
-It looks like we'll be going forward with a Lucene implementation; hopefully looking at a prototype version by the end of next week. Issues at the moment include speed/memory, and index quality. Adding a single document when it's created, or updated reduces the index quality, and this can only be corrected by an index optimiser, which doesn't seem to have been ported to PHP
+The Zend Framework has a port of the Apache Lucene project in PHP 5, and I am currently using this to implement searching in Moodle. Issues at the moment are mainly index quality and document updating/removing. Adding a single document when it's created reduces the index quality, and this can only be corrected by an index optimiser, which doesn't seem to have been ported to PHP yet. One solution would be to have a list of new/updated documents, and when it reaches x (x = 100 for e.g.) or the indexer hasn't been run for more than a day, the script runs. The optimiser can then be scheduled to run every week or so.
-yet. Even with an index optimiser, an update for a single document would work like this: delete current doc index, add new doc to index, run optimiser. So the better solution would be to have a list of updated documents, and when it reaches x (x = 100 for e.g.) or the indexer hasn't been run for more than a day, the script runs. The optimiser can then be scheduled to run every week or so. Speed/memory will be an inherent problem with PHP, and even if we develop our own system "in-house", I doubt we'll do any better than the Zend Framework team have done so far.
-Maybe we can implement a very basic temporary search system that only checks against documents not yet in the main index - so when someone searches the search page can say: "These documents might match your query: 1. 2., but they're not yet in the primary index." (using SELECT ... LIKE ...)
+Speed/memory will be an inherent problem with PHP, and even if we develop our own system "in-house", I doubt we'll do any better than the Zend Framework team have done so far - that said the performance of the current code is pretty acceptable, see below.
-== Lucene Performance ==
+Maybe we can implement a very basic temporary search system that only checks against documents not yet in the main index - in the case of the indexer only running every few days - so when someone searches the search page can say: "These documents might match your query: 1. 2., but they're not yet in the primary index." (using SELECT ... LIKE ...)
-Just a quick update to get some rough (and possibly incorrect) numbers out:
-It takes me 164 seconds to index 3588 wiki pages, and 0.224 seconds to return results for "css".
-This is on a 2.083GHz AMD with 1.5GB RAM, and Apache2 with PHP 5.1.4. The index files take up 900kb on disk.
-This really is pretty rough at the moment, some wiki pages didn't run through the indexer, I'm going to try figure out why now,
+=== Lucene Performance ===
-and the results seem a bit skewed, so I'm not actually sure if everything is cool.
+Just a quick update to get some better rough numbers out:
+It takes me 168.250 seconds to index all 5734 wiki pages, and 0.588 seconds to return 418 results for "css".
+This is on a 2.083GHz AMD with 1.5GB RAM, and Apache2 with PHP 5.1.4. The index files take up 2304kb on disk.
+And 57 seconds to index the 1000-odd latest versions of the wiki pages, resulting in 1200kb of index files.
+[ [[User:vf|Valery Fremaux]] ] Lucene implementation was updated to 1.0.1 rev 1042. The update fixes major UTF8 problems the search engine had that avoided any use over 1.7.
+== Permissions ==
+A big bullet-point in this project is restricting search results to the correct permission group (a normal user will only get results that he/she can actually view, i.e. no admin pages, etc.). I've started working on this - Each result document object has the appropriate user/group/courseid stored with it, so hopefully that'll work to calculate whether or not it should be shown.
+[ [[User:vf|Valery Fremaux]] ] Code for trapping unallowed results has been completed for version 1.8 of Moodle. Additional Zend fields have been added to check the exact context of the resource against users (what context is actually opposed to users when they get it). The search engine records any access related context that may be needed by the specific capability check of each document search extension. Generic capabilities are resolved within the main result processing algorithm, while very specific capailities checks are providen by the module dependant search extension).
+== Database ==
+The database only contains one search related table at the moment, search_documents. This is just a summary of every document indexed, and contains the document's title, url, last-update-date and user/group/courseid. The table is synchronised with the index on disk (flat-file) using it's primary key (DB->search_document->id == index->document->dbid).
 == Admin page ==
+These are some settings that needed to be implemented in an admin page.
 # Block Settings
@@ Line 26: / Line 37: @@
 ## Record result popularity [on/off]
+== Updating content ==
+This is an issue at the moment, because there is no index optimiser like [http://www.getopt.org/luke Luke] that has been implemented in PHP. ZSL 0.14 has introduced updating/deleting record methods, but adding new content 1 page at a time will really bog down the index files - so hopefully the optimiser will be implemented in ZSL soon.
-== Old Content ==
+Lazy- or eager-updates? Lazy is conducive to a regular cron update script, eager better for small cases (single page update); and eager probably not so good for re-indexing a restored backup? Must discuss this.
-'''Search engine'''
-# [[#crawler|Crawler]]
-## [[#content2|Content to be indexed]]
-## [[#indexing|Indexing]]
-## [[#storage|Storage of results]]
-## [[#updates|Updating content]]
-# [[#query|Query module]]
-## [[#input|Input module]]
-## [[#thequery|The query]]
-## [[#results|Results page]]
-'''Other topics'''
-# [[#api|API]]
-# [[#i18n|Internationalisation]]
-# [[#meta|Metadata]]
-# [[#rank|Ranking]]
-# [[#random|Extras]]
-== Implementation ==
-. '''<span id="crawler">Crawler</span>'''
-: The crawler will be an internal process that has access to the same structured data as the rest of the system (i.e. the database fields). This will improve crawling efficiency, and thus running time. Using internal data will also ensure that most of the content to be indexed will be standardised and quirk-free (no unclosed tags, etc.). A flagging/callback system will notify the crawler about last-update times and new content, enabling up-to-date searches.
-: 1.1 '''<span id="content2">Content to be indexed</span>'''
-:: Content to be indexed will be fields stored in the database, for example, the wiki module will expose a structure like '''(table_name|field1,field2)''', the crawler can then see that '''field1''' and '''field2''' are text-data located in '''table_name'''. Text in the moodle system usually has an associated format field which specifies if it is plain-text, HTML, mf, and so on. '''format_text()''' prepares text for final display, and renders all text into correct HTML format - we could use the output of this function as the final content to be indexed ('''strip_tags()''' to remove the HTML). This has the added advantage of searching the same text the user sees if he/she browses the site manually - in more detail, '''format_text()''' applies all the appropriate text filters, and one of them is the censor module, which removes certain words/phrases from output text, thus whilst a word may appear in the database a user may never get to see it because it is on the censor-list. So, in this case it would be better to index the text without the offending word (to prevent it from showing in the search result listing).
-:: HTML tags will be ignored for now - they introduce a lot of other issues that are not important at the moment.
-:: '''Structure'''
-:: [content parser] -> [indexer] ==> (results into database)
-:: Seperating the indexer into two modules (parser and indexer) allows different parsers for different situations to be dropped in - for e.g. an HTML parser for static-content pages (non-database). A standard for the text passed to the indexer should be created; perhaps have an additional step in the pipeline for removing stop-words, and stemming etc.
-: 1.2 '''<span id="indexing">Indexing</span>'''
-:: Efficiency/run time: this must be quick, investigate PHP optimisations. (Any chance of dropping down to a faster language if need be?). Like updates, is this eager- or lazy- indexing - i.e. index documents as they are submitted for the first time, or do batch indexing jobs at a certain time (e.g. cron <script>).
-:: '''Tables''':
-:: ('''documents'''|doc_id, title, parent_id, type, permissions)
-:: ('''words'''|id, word)
-:: ('''postings'''|doc_id, word_id)
-:: ('''popular'''|doc_id, [word_id's]) (?)
-:: Don't worry about the above yet, just putting some basic structures down for now.
-: 1.3 '''<span id="storage">Storage of results</span>'''
-:: A selling point of Moodle is it's cross-db compatibility - as such, all database calls go through AdoDB, and furthermore it follows that the search module should really try not to use any DBMS-specific features (e.g. MySQL full-text search). All of the db queries should be standard SQL. Personally, I don't think this is a problem as I'm imagining most of the queries will be basic insert/update/delete's and selects with joins. The schema(s) must be defined in XML.
-:: Ranking will be calculated in the module, using some formula and various bits of information selected from the database.
-: 1.4 '''<span id="updates">Updating content</span>'''
-:: Lazy- or eager-updates? Lazy is conducive to a regular cron update script, eager better for small cases (single page update); and eager probably not so good for re-indexing a restored backup? Must discuss this.
-. '''<span id="query">Query module</span>'''
-: Similar to indexer, will have an input parser to take the user's string and turn it into a standardised query/structure.
-: Important: only search pages/texts that the user has permission to see! In database, there will be a main listing of all searchable documents, with their title, type, URI, parent ID stored, etc. To this table add some form of permission token, a text identifier matching the current method the permission system uses. Can then check if '''student''' can '''forum_canreadposts''', for example - and then check the student against this particular forum, and so on. The system must be flexible enough to allow future Moodle versions to continue with new permission schemes.
-: Have fields for userid, groupid, courseid for each text/document.
-: 2.1 '''<span id="input">Input module</span>'''
-:: Boolean search grammer:
-::: '''+''' required keyword
-::: '''-''' don't want keyword
-::: '''~''' reduces rank (but can be present)
-::: '''<''' decrease word contribution compared to other words in query
-::: '''>''' increase word contribution to overall rank
-:: This should include at least all of the options available here: [http://moodle.org/mod/forum/search.php?id=5].
-: 2.2 '''<span id="thequery">The query</span>'''
-:: The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).
-:: Search in user's/the specific Moodle's language...? see [[#i18n|Internationalisation]].
-: 2.3 '''<span id="results">Results page</span>'''
+== Query ==
-:: Show the first '''n''' words from a matching page (from the start), or try show the relevant matching text? - something only possible with phrase matching/proximity calculations, probably something to leave out for the time being. Another reason to make sure everything is modular, so features like this can be dropped in later!
+This should include at least all of the options available here: [http://moodle.org/mod/forum/search.php?id=5].
-:: Ranking: [[#rank|Ranking]]
+The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).
-:: Highlight search keywords, pagination, search within results?
+Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).
-----
+== Metadata ==
+Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (like wiki_pages table has a 'hits' column).
-. '''<span id="api">API</span>'''
-: It would be good to develop and possibly use a web-interface for this. Operate by passing XML via POST to the search module, which then performs whichever task was asked. The benefit of this is that it seperates the search module from the PHP language, allowing anything capable of sending a POST command to a web-server to interact with the software. I'm possibly trying to be to general here, since this is a specific application for Moodle, but the option could always be built-in, but it doesn't have to be used. So, pretty much a not-necessary feature for the basic project.
-. '''<span id="i18n">Internationalisation</span>'''
+== Physical File Support ==
-: Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).
-. '''<span id="meta">Metadata</span>'''
+[ [[User:vf|Valery Fremaux]] ] A new physical file support has been developped for the 1.8 update of the global search. This implementation supports major well-known document formats, through open-source document "to text" converters.
-: Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (more below, in Ranking).
-. '''<span id="rank">Ranking</span>'''
+Supported formats are :
-: Beyond basic statistical word ranking, we can have additional fields in the database that measure a document's popularity (as a chosen URI to visit after a search), and the words used to find it. This would create data that allows things like "20 other users visited this document after searching for 'moo cows'."
+* MSWord/OpenOffice (DOC) format
+* MSPowerPoint (PPT) format
+* Adobe (PDF) format
+* XML format (based on entity CDATA content)
+* HTML format (based on HTML tags content)
+* fulltext (TXT) format
+Only ressource indexed as FILE type resources are indexed. Support for other attachements is not yet providen.
 == See also ==
 *[[Student projects]]
+*[http://moodle.org/mod/forum/discuss.php?d=48715 Forum topic]
 [[Category:Developer]]
 [[Category:Project]]

Documentation