Note:

If you want to create a new page for developers, you should create it on the Moodle Developer Resource site.

Global search brainstorming: Difference between revisions

From MoodleDocs
No edit summary
Line 65: Line 65:
To allow for near real-time updates to the data, a module needs to implement following events:
To allow for near real-time updates to the data, a module needs to implement following events:
* when data is updated
* when data is updated
* mod_search_events()  
And functions:
* mod_search_events() - to describe the events to the search engine


==Search Engine==
==Search Engine==

Revision as of 22:14, 21 November 2011

Note: This page is a work-in-progress. Feedback and suggested improvements are welcome. Please join the discussion on moodle.org or use the page comments.

Moodle 2.3


Global Search is a proposed rewrite of the search mechanism for Moodle 2.3. Also see the discussion on the forum.

Requirements

  1. Security must be handled - search should not return any results that are not accessible for the current user.
  2. Security can be handled globally only up to some level - a module must have final say if a document is accessible. Think about more complex modules (like workshop) that can allow access to a document only at specific time (e.g. only when in peer review phase). Access information can and should be stored in the index for performance reasons but each module should be responsible for the decision of denying or granting access.
  3. External documents must be indexed and search-able. Think about all the docs uploaded into Moodle installations as resources.
  4. Search index must be updated in real time (say with few minutes delay?).
  5. Notifications should be handled using Moodle events. There is no point in implementing separate notification system for Search - if our existing events are not usable then let's fix them or drop them.
  6. Solution must scale up. Let's put some numbers here that we can use for testing - 1m documents, with average 100 words each (so 100m words in total, 1m unique)?
  7. Initial indexing must be able to stop & pick up where it left, so when a big site enables search, initial indexing can be done in chunks over days (if needed).
  8. A search results must allow ordering by relevancy. This is a big topic in itself but for example searching for "dog cats" should list first documents that contain both words and then those with a single one only. Searching for a "dog" should mark documents that contain word "dog" in several places as more relevant than those that contain it only once.
  9. Advanced queries should be implemented to allow for:
    1. grouping of ANDs and ORs (e.g. word1 AND (word2 OR word3))
    2. returning results that do not contain a keyword
    3. stemming (searching for "car" will return results with "cars")
    4. wildcards (searching super* will find superman and superwoman)
    5. attributes search (find XX in title)
    6. phrase search (find phrase "frog zombies")
  10. Other nice to have search features but not really required:
    1. proximity searching
    2. fuzzy searching
    3. handling case sensitivity
    4. regular expressions search
  11. There must be "out of the box" search that will not require additional server software (should be PHP only).
  12. The API for the module author should be as simple to implement as possible.

Implementation

The whole solution will consist of few parts:

  • The core Search Module
  • Modules API
  • The Search Engine

Core Search Module

The core implementation that will provide:

  • cron job for the incremental indexing of the content
  • UI pages to present indexing information
  • UI for search and search results
  • controls for clearing the index and full re-indexing
  • caching
  • processing files to extract text (pdf, doc, etc).

Indexing

Indexing will be triggered from cron and will run for a defined, configurable time (5 minutes default). For each registered module, indexer will run mod_get_search_iterator($from) and will remember the value of the last indexed timestamp for the next run. Iterator will return $id, as defined/understood by the module. This $id will point to a document set. Document set consists of zero of more documents that will be indexed.

Modules API

Each module will need to implement a set of API to make itself search-able.

Basic support

The basic support will make module's exposed data search-able but will not allow for real-time searches. A module will need to:

  • declare that it supports FEATURE_GLOBAL_SEARCH (in function <mod>_supports).
  • implement 3 functions:
    • mod_get_search_iterator($from = 0)
    • mod_get_search_documents($id)
    • mod_search_access($id)

Both content and documents can be returned by modules as search documents. This is the very minimum that needs to be exposed but the following limitations will apply:

  • There is no way to notify the search code that new document has been added. Indexer will run again with the next cron run.
  • When a content is updated or created but the record's timestamp doesn't change, it will not be indexed.

Advanced support

To allow for near real-time updates to the data, a module needs to implement following events:

  • when data is updated

And functions:

  • mod_search_events() - to describe the events to the search engine

Search Engine

Will be responsible for actual indexing, processing, storing and searching. The initial implementation will use Lucene search engine as implemented by Zend (PHP library, no Java required). The search engine will be de-coupled so it should be possible to replace it with another search engine (e.g. solr, sphinx, xapian).


  • Will not require additional DB tables to store index.
  • Will allow for partial indexing of the content (e.g. the first time indexer is run, it may only index part of the huge site, and pick up where is left on the next run).
  • Will allow for indexing content from the DB and attachments (e.g. forum post & all attachments).



Search API description

mod_get_search_iterator($from=0)

Function has to return moodle recordset with two columns:

  • document set id for this particular module
  • timestamp of the last modification

$from is used to return only documents that were modified after given timestamp. If $from equals 0 then all documents should be returned in the recordset. The meaning of the document set will differ for each module. For example, forum module will return post id as a single document set id. Inside such a document set, there will be several documents, in our case it can be actual forum post and several attachments. Security (access rights) will be managed on the document set level. In this example, you either have or don't have access to both forum post and it's attachments (fixing some bugs like MDL-29660 will be required here). Modules that want to support Global Search interface, have to keep track of the last modification time of their document. timemodified column must be indexed.

The code for this function will usually be very simple, here is a complete example for forum:

function forum__get_search_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, modified FROM {forum_posts} WHERE modified >= ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));
}

mod_get_search_documents($id)

Method returns an array of documents for given ID. For forum, it would be one document for forum post and zero or more for post attachments. Each document is a simple object of gs_document class. Following attributes can be set:

  • $type Document type, can be set to GS_TYPE_TEXT (plain text in $content), GS_TYPE_HTML (HTML in $content), GS_TYPE_FILE (external file)
  • $contextlink Link to a context of the document (e.g. forum post)
  • $directlink Direct link to the document (e.g. document attached to a post). By default $directlink and $contextlink are the same (you can set any of them)
  • $title Title of the document
  • $content The main content. For external files, the field is ignored.
  • $user Author's user object.
  • $module Module that has created original document.
  • $id Id that identifies a set of documents for a given module.
  • $created Timestamp when the document was created.
  • $modified Timestamp of the last modification.
  • $courseid The course ID where the document is located.
  • $filepath Path to the external file with the document. Set when type is GS_TYPE_FILE.
  • $mime Mime type of the external file. Set when type is GS_TYPE_FILE.

You can see working implementation of forum_gs_get_documents.

mod_search_access($id)

Function to check if current $USER has access to the document set $id. Return value should be one of:

  • GS_ACCESS_DENIED - current user can not access document set $id
  • GS_ACCESS_GRANTED - access granted
  • GS_ACCESS_DELETED - the document set with this $id does not exist

External files indexing

Following file formats will be supported:

  • doc
  • docx
  • ppt
  • pptx
  • pdf
  • xml
  • html
  • txt
  • odt
  • odp

Edge cases

Backup restore

When a backup is restored, it may create a course that contains data (document sets) - e.g. user's posts. When restored, timemodified may be in past. This means that time-based iterator will not "pick up" those entries. One solution is to implement a logic in the core search to watch for "missing" IDs and query for them in hope to retrieve valid documents. This will work if modules will use numeric and incrementing IDs.

Performance

Few identified performance killers or bottlenecks.

Access check

Problem

Each document in the search result needs to be checked if it's accessible for the currently logged in user. As only a module will know for sure if a user has access to a result, this may generate enormous amount of calls to mod_search_access($id) function, which may be very expensive. Notice that possible maximum number of calls to mod_search_access($id) per search is not the number of results per page but the number of all matched documents. This is the worst case scenario, when user have access to less than results_per_page of all the results matched.

Possible solutions

It may be possible to quickly filter out at least some of the search results based on the data in the search index. For example, if user is not a participant in $courseid, search index can assume that she will not have access to all the results like that. In this case a list of all course IDs would need to be passed from search code into search engine.

Search API could be extended to allow modules to declare "simplified access". This would be a way of module saying: "look search engine, the access rules for my content are very simple, you can make the access decision by yourself".

Score calculation

Problem

Usually the most costly operation for a search engine is to calculate relevancy (score) of each search result. If search matches a lot of results, then time for calculating relevancy for each document may be substantial.

Possible solutions

Ideally search engine would work like this:

  • find matching documents
  • filter out documents that are not accessible for the current user
  • calculate relevancy

Basically, the documents that user can not access anyway should be filtered out as soon as possible.

Testing

A list of test cases.

New content

  • index
  • add new forum post
  • re-index
  • make sure new content is searchable

Updated content

  • index
  • edit existing forum post
  • re-index
  • make sure new content is searchable
  • make sure old content is not searchable

Deleted content

  • index
  • delete existing forum post
  • re-index
  • make sure deleted content is not searchable

Backup restored

  • index
  • restore whole course from a backup
  • re-index
  • make sure restored content is searchable

Ideas

  • Maybe you want modules to have a say in how an item shows up in the search result? Also, you may want to think about how global search fits in with the often-requested anonymity features (e.g. anonymous forum posts) -- e.g. if a non-priviledged user searches for a post's author, their "anonymous" forum post should not show up.
  • Not only modules could be searchable. Extra search modules could be implemented as plugins, in a similar way other Moodle plugins are handled. They could allow for doing searches on non-module related data: e.g. courses, users.