Student projects/Global search: Difference between revisions

Revision as of 02:30, 2 June 2006

Implementing a global search solution in Moodle

Search engine

Crawler
Query module

Other topics

API
Internationalisation
Metadata
Ranking
Extras

Details

1. Crawler

The crawler will be an internal process that has access to the same structured data as the rest of the system (i.e. the database fields). This will improve crawling efficiency, and thus running time. Using internal data will also ensure that most of the content to be indexed will be standardised and quirk-free (no unclosed tags, etc.). A flagging/callback system will notify the crawler about last-update times and new content, enabling up-to-date searches.

1.1 Content to be indexed

Content to be indexed will be fields stored in the database, for example, the wiki module will expose a structure like (table_name|field1,field2), the crawler can then see that field1 and field2 are text-data located in table_name. Text in the moodle system usually has an associated format field which specifies if it is plain-text, HTML, mf, and so on. format_text() prepares text for final display, and renders all text into correct HTML format - we could use the output of this function as the final content to be indexed (strip_tags() to remove the HTML). This has the added advantage of searching the same text the user sees if he/she browses the site manually - in more detail, format_text() applies all the appropriate text filters, and one of them is the censor module, which removes certain words/phrases from output text, thus whilst a word may appear in the database a user may never get to see it because it is on the censor-list. So, in this case it would be better to index the text without the offending word (to prevent it from showing in the search result listing).

HTML tags will be ignored for now - they introduce a lot of other issues that are not important at the moment.

Structure

[content parser] -> [indexer] ==> (results into database)

Seperating the indexer into two modules (parser and indexer) allows different parsers for different situations to be dropped in - for e.g. an HTML parser for static-content pages (non-database). A standard for the text passed to the indexer should be created; perhaps have an additional step in the pipeline for removing stop-words, and stemming etc.

1.2 Indexing

Efficiency/run time: this must be quick, investigate PHP optimisations. (Any chance of dropping down to a faster language if need be?). Like updates, is this eager- or lazy- indexing - i.e. index documents as they are submitted for the first time, or do batch indexing jobs at a certain time (e.g. cron <script>).

Tables:

(documents|doc_id, title, parent_id, type, permissions)

(words|id, word)

(postings|doc_id, word_id)

(popular|doc_id, [word_id's]) (?)

Don't worry about the above yet, just putting some basic structures down for now.

1.3 Storage of results

A selling point of Moodle is it's cross-db compatibility - as such, all database calls go through AdoDB, and furthermore it follows that the search module should really try not to use any DBMS-specific features (e.g. MySQL full-text search). All of the db queries should be standard SQL. Personally, I don't think this is a problem as I'm imagining most of the queries will be basic insert/update/delete's and selects with joins. The schema(s) must be defined in XML.

Ranking will be calculated in the module, using some formula and various bits of information selected from the database.

1.4 Updating content

Lazy- or eager-updates? Lazy is conducive to a regular cron update script, eager better for small cases (single page update); and eager probably not so good for re-indexing a restored backup? Must discuss this.

2. Query module

Similar to indexer, will have an input parser to take the user's string and turn it into a standardised query/structure.

Important: only search pages/texts that the user has permission to see! In database, there will be a main listing of all searchable documents, with their title, type, URI, parent ID stored, etc. To this table add some form of permission token, a text identifier matching the current method the permission system uses. Can then check if student can forum_canreadposts, for example - and then check the student against this particular forum, and so on. The system must be flexible enough to allow future Moodle versions to continue with new permission schemes.

Have fields for userid, groupid, courseid for each text/document.

2.1 Input module

Boolean search grammer:

+ required keyword

- don't want keyword

~ reduces rank (but can be present)

< decrease word contribution compared to other words in query

> increase word contribution to overall rank

This should include at least all of the options available here: [1].

2.2 The query

The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).

Search in user's/the specific Moodle's language...? see Internationalisation.

2.3 Results page

Show the first n words from a matching page (from the start), or try show the relevant matching text? - something only possible with phrase matching/proximity calculations, probably something to leave out for the time being. Another reason to make sure everything is modular, so features like this can be dropped in later!

Ranking: Ranking

Highlight search keywords, pagination, search within results?

1. API

It would be good to develop and possibly use a web-interface for this. Operate by passing XML via POST to the search module, which then performs whichever task was asked. The benefit of this is that it seperates the search module from the PHP language, allowing anything capable of sending a POST command to a web-server to interact with the software. I'm possibly trying to be to general here, since this is a specific application for Moodle, but the option could always be built-in, but it doesn't have to be used. So, pretty much a not-necessary feature for the basic project.

2. Internationalisation

Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).

3. Metadata

Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (more below, in Ranking).

4. Ranking

Beyond basic statistical word ranking, we can have additional fields in the database that measure a document's popularity (as a chosen URI to visit after a search), and the words used to find it. This would create data that allows things like "20 other users visited this document after searching for 'moo cows'."

@@ Line 1: / Line 1: @@
-(First draft + random ideas)
 == Implementing a global search solution in Moodle ==
+'''Search engine'''
 # [[#crawler|Crawler]]
 ## [[#content2|Content to be indexed]]
@@ Line 12: / Line 11: @@
 ## [[#thequery|The query]]
 ## [[#results|Results page]]
+'''Other topics'''
+# [[#api|API]]
+# [[#i18n|Internationalisation]]
+# [[#meta|Metadata]]
+# [[#rank|Ranking]]
+# [[#random|Extras]]
 == Details ==
 . '''<span id="crawler">Crawler</span>'''
-: The crawler will in all likelihood be an internal module (as opposed to an HTML based crawler, that spiders through the moodle site using HTTP), that can 'sense' content that has to be indexed. 'Sense' in this situation will probably involve some form of flagging/callback system that is used to notify the module of database fields to extract and index, and similarly for updates and deletions.
+: The crawler will be an internal process that has access to the same structured data as the rest of the system (i.e. the database fields). This will improve crawling efficiency, and thus running time. Using internal data will also ensure that most of the content to be indexed will be standardised and quirk-free (no unclosed tags, etc.). A flagging/callback system will notify the crawler about last-update times and new content, enabling up-to-date searches.
 : 1.1 '''<span id="content2">Content to be indexed</span>'''
-:: Content to be indexed will be fields stored in the database, for example, the wiki module will expose a structure like '''(table_name|field1,field2)''', the crawler can then see that '''field1''' and '''field2''' are text-data located in '''table_name'''.
+:: Content to be indexed will be fields stored in the database, for example, the wiki module will expose a structure like '''(table_name|field1,field2)''', the crawler can then see that '''field1''' and '''field2''' are text-data located in '''table_name'''. Text in the moodle system usually has an associated format field which specifies if it is plain-text, HTML, mf, and so on. '''format_text()''' prepares text for final display, and renders all text into correct HTML format - we could use the output of this function as the final content to be indexed ('''strip_tags()''' to remove the HTML). This has the added advantage of searching the same text the user sees if he/she browses the site manually - in more detail, '''format_text()''' applies all the appropriate text filters, and one of them is the censor module, which removes certain words/phrases from output text, thus whilst a word may appear in the database a user may never get to see it because it is on the censor-list. So, in this case it would be better to index the text without the offending word (to prevent it from showing in the search result listing).
+:: HTML tags will be ignored for now - they introduce a lot of other issues that are not important at the moment.
+:: '''Structure'''
 :: [content parser] -> [indexer] ==> (results into database)
-:: Seperating the indexer into two modules (parser and indexer) allows different parsers for different situations to be dropped in - e.g. an HTML parser for static-content pages (non-database). A standard for the text passed to the indexer should be created; perhaps have an additional step in the pipeline for removing stop-words, and stemming etc.
+:: Seperating the indexer into two modules (parser and indexer) allows different parsers for different situations to be dropped in - for e.g. an HTML parser for static-content pages (non-database). A standard for the text passed to the indexer should be created; perhaps have an additional step in the pipeline for removing stop-words, and stemming etc.
 : 1.2 '''<span id="indexing">Indexing</span>'''
-:: Possible structures :
+:: Efficiency/run time: this must be quick, investigate PHP optimisations. (Any chance of dropping down to a faster language if need be?). Like updates, is this eager- or lazy- indexing - i.e. index documents as they are submitted for the first time, or do batch indexing jobs at a certain time (e.g. cron <script>).
-:: [indexed word, doc_id, meta-keyword (bool), html_tag, position in text]
+:: '''Tables''':
+:: ('''documents'''|doc_id, title, parent_id, type, permissions)
+:: ('''words'''|id, word)
+:: ('''postings'''|doc_id, word_id)
+:: ('''popular'''|doc_id, [word_id's]) (?)
-:: '''html_tag''' for importance testing, i.e. &lt;b&gt;word&lt;/b&gt; beats &lt;em&gt;word&lt;/em&gt; in ranking.
+:: Don't worry about the above yet, just putting some basic structures down for now.
-:: '''position in text''' for testing proximity of words (for complete phrase ranking) - this depends on performance, start with basic indexing and work up from that.
 : 1.3 '''<span id="storage">Storage of results</span>'''
-:: Whichever DBMS used for the Moodle installation, will also be used for storing the inverted indices. There is a paper (insert link here) documenting optimisations and structures to put an RDBMS implementation together to rival a typical B-Tree/heap file system.
+:: A selling point of Moodle is it's cross-db compatibility - as such, all database calls go through AdoDB, and furthermore it follows that the search module should really try not to use any DBMS-specific features (e.g. MySQL full-text search). All of the db queries should be standard SQL. Personally, I don't think this is a problem as I'm imagining most of the queries will be basic insert/update/delete's and selects with joins. The schema(s) must be defined in XML.
+:: Ranking will be calculated in the module, using some formula and various bits of information selected from the database.
 : 1.4 '''<span id="updates">Updating content</span>'''
@@ Line 42: / Line 56: @@
 . '''<span id="query">Query module</span>'''
 : Similar to indexer, will have an input parser to take the user's string and turn it into a standardised query/structure.
+: Important: only search pages/texts that the user has permission to see! In database, there will be a main listing of all searchable documents, with their title, type, URI, parent ID stored, etc. To this table add some form of permission token, a text identifier matching the current method the permission system uses. Can then check if '''student''' can '''forum_canreadposts''', for example - and then check the student against this particular forum, and so on. The system must be flexible enough to allow future Moodle versions to continue with new permission schemes.
+: Have fields for userid, groupid, courseid for each text/document.
 : 2.1 '''<span id="input">Input module</span>'''
@@ Line 50: / Line 68: @@
 ::: '''<''' decrease word contribution compared to other words in query
 ::: '''>''' increase word contribution to overall rank
+:: This should include at least all of the options available here: [http://moodle.org/mod/forum/search.php?id=5].
 : 2.2 '''<span id="thequery">The query</span>'''
-:: ...
+:: The interface is important from user perspective. Apparently the average query length from an average user is 1.3 words (and then they press enter and wait for the results). Important to make the most basic case (1.3 words + submit) follow the best (read predictable) course of action and return the best results. Additional options for searching listed below search box, all optional. Have more than the usual lacking 10 results on the first page, 25 (customisable at least).
+:: Search in user's/the specific Moodle's language...? see [[#i18n|Internationalisation]].
 : 2.3 '''<span id="results">Results page</span>'''
+:: Show the first '''n''' words from a matching page (from the start), or try show the relevant matching text? - something only possible with phrase matching/proximity calculations, probably something to leave out for the time being. Another reason to make sure everything is modular, so features like this can be dropped in later!
+:: Ranking: [[#rank|Ranking]]
 :: Highlight search keywords, pagination, search within results?
+----
+. '''<span id="api">API</span>'''
+: It would be good to develop and possibly use a web-interface for this. Operate by passing XML via POST to the search module, which then performs whichever task was asked. The benefit of this is that it seperates the search module from the PHP language, allowing anything capable of sending a POST command to a web-server to interact with the software. I'm possibly trying to be to general here, since this is a specific application for Moodle, but the option could always be built-in, but it doesn't have to be used. So, pretty much a not-necessary feature for the basic project.
+. '''<span id="i18n">Internationalisation</span>'''
+: Moodle is going Unicode eventually, is the search going to cope with the transition? Problems include diacritic translation (umlaut -> 'ue', e(accent) -> e, etc.). Does the search explicitly know which language it's indexing/querying? Some languages don't have obvious word boundaries (like english with spaces).
+. '''<span id="meta">Metadata</span>'''
+: Metadata is a hassle to obtain - one option is to make authors include a few metadata tags when they create new content. "Free" metadata includes timestamps, author name, document type, size. Could use server logs to generate document popularity ratings to create a "Popular" meta field (more below, in Ranking).
+. '''<span id="rank">Ranking</span>'''
+: Beyond basic statistical word ranking, we can have additional fields in the database that measure a document's popularity (as a chosen URI to visit after a search), and the words used to find it. This would create data that allows things like "20 other users visited this document after searching for 'moo cows'."
 == See also ==
 *[[Student projects]]
-*[http://2006.planet-soc.com/?q=blog/163 Michael Champanis @ Planet-SOC]
---[[User:Michael Champanis|Michael Champanis]] 05:30, 30 May 2006 (WST)
+--[[User:Michael Champanis|Michael Champanis]] 10:30, 2 June 2006 (WST)
 [[Category:Developer]]
 [[Category:Project]]

Documentation

Student projects/Global search: Difference between revisions

Revision as of 02:30, 2 June 2006

Implementing a global search solution in Moodle

Details

See also