Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

Global search brainstorming: Difference between revisions

From MoodleDocs
No edit summary
Line 3: Line 3:


Global Search is a proposed rewrite of the search mechanism for Moodle 2.3.
Global Search is a proposed rewrite of the search mechanism for Moodle 2.3.
=Requirements=
# Security must be handled - search should not return any results that are not accessible for the current user.
# Security can be handled globally only up to some level - a module must have final say if a document is accessible. Think about more complex modules (like workshop) that can allow access to a document only at specific time (e.g. only when in peer review phase). Access information can and should be stored in the index for performance reasons but each module should be responsible for the decision of denying or granting access.
# External documents must be indexed and search-able. Think about all the docs uploaded into Moodle installations as resources.
# Search index must be updated in real time (say with few minutes delay?).
# Notifications should be handled using Moodle events. There is no point in implementing separate notification system for Search - if our existing events are not usable then let's fix them or drop them.
# Solution must scale up. Let's put some numbers here that we can use for testing - 1m documents, with average 100 words each (so 100m words in total, 1m unique)?
# Initial indexing must be able to stop & pick up where it left, so when a big site enables search, initial indexing can be done in chunks over days (if needed).
# A search results must allow ordering by relevancy. This is a big topic in itself but for example searching for "dog cats" should list first documents that contain both words and then those with a single one only. Searching for a "dog" should mark documents that contain word "dog" in several places as more relevant than those that contain it only once.
# Advanced queries should be implemented to allow for:
## grouping of ANDs and ORs (e.g. word1 AND (word2 OR word3))
## returning results that do not contain a keyword
## stemming (searching for "car" will return results with "cars")
## wildcards (searching super* will find superman and superwoman)
## attributes search (find XX in title)
## phrase search (find phrase "frog zombies")
# Other nice to have search features but not really required:
## proximity searching
## fuzzy searching
## handling case sensitivity
## regular expressions search
11. Solution must not require additional server software (should be PHP only).


=Implementation=
=Implementation=

Revision as of 13:59, 31 October 2011

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Moodle 2.3


Global Search is a proposed rewrite of the search mechanism for Moodle 2.3.

Requirements

  1. Security must be handled - search should not return any results that are not accessible for the current user.
  2. Security can be handled globally only up to some level - a module must have final say if a document is accessible. Think about more complex modules (like workshop) that can allow access to a document only at specific time (e.g. only when in peer review phase). Access information can and should be stored in the index for performance reasons but each module should be responsible for the decision of denying or granting access.
  3. External documents must be indexed and search-able. Think about all the docs uploaded into Moodle installations as resources.
  4. Search index must be updated in real time (say with few minutes delay?).
  5. Notifications should be handled using Moodle events. There is no point in implementing separate notification system for Search - if our existing events are not usable then let's fix them or drop them.
  6. Solution must scale up. Let's put some numbers here that we can use for testing - 1m documents, with average 100 words each (so 100m words in total, 1m unique)?
  7. Initial indexing must be able to stop & pick up where it left, so when a big site enables search, initial indexing can be done in chunks over days (if needed).
  8. A search results must allow ordering by relevancy. This is a big topic in itself but for example searching for "dog cats" should list first documents that contain both words and then those with a single one only. Searching for a "dog" should mark documents that contain word "dog" in several places as more relevant than those that contain it only once.
  9. Advanced queries should be implemented to allow for:
    1. grouping of ANDs and ORs (e.g. word1 AND (word2 OR word3))
    2. returning results that do not contain a keyword
    3. stemming (searching for "car" will return results with "cars")
    4. wildcards (searching super* will find superman and superwoman)
    5. attributes search (find XX in title)
    6. phrase search (find phrase "frog zombies")
  10. Other nice to have search features but not really required:
    1. proximity searching
    2. fuzzy searching
    3. handling case sensitivity
    4. regular expressions search


11. Solution must not require additional server software (should be PHP only).

Implementation

  • Will use Lucene search engine implemented by Zend (PHP library, no Java required).
  • Will not require additional DB tables to store index.
  • Will allow for partial indexing of the content (e.g. the first time indexer is run, it may only index part of the huge site, and pick up where is left on the next run).
  • Will allow for indexing content from the DB and attachments (e.g. forum post & all attachments).
  • Will implement at least basic cache.

Modules support

Interface will need to be implemented for a module that wants to be search-able by Global Search. A module will need to:

  • declare that it supports FEATURE_GLOBAL_SEARCH (in function <mod>_supports).
  • implement 3 functions:
    • <mod>_gs_iterator($from = 0)
    • <mod>_gs_get_documents($id)
    • <mod>_page_gs_access($id)

mod_gs_iterator($from=0)

Function has to return moodle recordset with two columns:

  • document set id for this particular module
  • timestamp of the last modification

$from is used to return only documents that were modified after given timestamp. If $from equals 0 then all documents should be returned in the recordset. The meaning of the document set will differ for each module. For example, forum module will return post id as a single document set id. Inside such a document set, there will be several documents, in our case it can be actual forum post and several attachments. Security (access rights) will be managed on the document set level. In this example, you either have or don't have access to both forum post and it's attachments (fixing some bugs like MDL-29660 will be required here). Modules that want to support Global Search interface, have to keep track of the last modification time of their document. timemodified column must be indexed.

The code for this function will usually be very simple, here is a complete example for forum:

function forum_gs_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, modified FROM {forum_posts} WHERE modified > ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));
}

mod_gs_get_documents($id)

Method returns an array of documents for given ID. For forum, it would be one document for forum post and zero or more for post attachments. Each document is a simple object of gs_document class. Following attributes can be set:

  • $type Document type, can be set to GS_TYPE_TEXT (plain text in $content), GS_TYPE_HTML (HTML in $content), GS_TYPE_FILE (external file)
  • $contextlink Link to a context of the document (e.g. forum post)
  • $directlink Direct link to the document (e.g. document attached to a post). By default $directlink and $contextlink are the same (you can set any of them)
  • $title Title of the document
  • $content The main content. For external files, the field is ignored.
  • $user Author's user object.
  • $module Module that has created original document.
  • $id Id that identifies a set of documents for a given module.
  • $created Timestamp when the document was created.
  • $modified Timestamp of the last modification.
  • $courseid The course ID where the document is located.
  • $filepath Path to the external file with the document. Set when type is GS_TYPE_FILE.
  • $mime Mime type of the external file. Set when type is GS_TYPE_FILE.

You can see working implementation of forum_gs_get_documents.

mod_page_gs_access($id)

Function to check if current $USER has access to the document set $id. Return value should be one of:

  • GS_ACCESS_DENIED - current user can not access document set $id
  • GS_ACCESS_GRANTED - access granted
  • GS_ACCESS_DELETED - the document set with this $id does not exist

External files indexing

Following file formats will be supported:

  • doc
  • docx
  • ppt
  • pptx
  • pdf
  • xml
  • html
  • txt
  • odt
  • odp

Testing

A list of test cases.

New content

  • index
  • add new forum post
  • re-index
  • make sure new content is searchable

Updated content

  • index
  • edit existing forum post
  • re-index
  • make sure new content is searchable
  • make sure old content is not searchable

Deleted content

  • index
  • delete existing forum post
  • re-index
  • make sure deleted content is not searchable

Backup restored

  • index
  • restore whole course from a backup
  • re-index
  • make sure restored content is searchable