Global search brainstorming

Warning: This page is no longer in use. The information contained on the page should NOT be seen as relevant or reliable.

Global Search
Project state	Deferred, though please see Global search (GSoC2013)
Tracker issue	MDL-31989
Discussion	Global Search rewrite
Assignee

Global Search is a proposed rewrite of the search mechanism for Moodle 2.x. This has been accepted as a Google Summer of Code 2013 project - Global search (GSoC2013).

Update after Developer Hackfest on 2013.03.21

Breaking news.

There is no one working on the Global Search for Moodle 2.5/2.6 at the moment
It will most likely become a Google Summer of Code project
Lucene (especially PHP implementation) is a dead-end, we should not try to use it
There are no search engines written in PHP that we could re-use/embed
We should not try to re-create search engine functionality
Instead we should focus on creating Moodle API for supporting search engines like solr.

The above makes code on https://github.com/tmuras/moodle/tree/gs obsolete, e.g.

search/search_lucene_*.php don't need to be used (https://github.com/tmuras/moodle/blob/gs/search/document/search_document.php may still be OK for the new implementation)
the code should not be re-integrated but used only as a reference, new code is required (against latest Moodle dev)

Not using Lucene means not using the PHP, embedded version that I've tried to use in https://github.com/tmuras/moodle/tree/gs. Do not confuse it with the Java Lucene library that solr uses under the hood. The fact that solr uses Lucene is (to simplify) irrelevant for us here. For us, solr will be black box (not technically true, read on) search engine. In reality, to get high performance solution, a module/plugin for solr will be required. On the solr side, there should be some Moodle-specific logic executed to help with processing Moodle permissions. E.g. ideally solr, depending on what kind of user is performing a query should do some initial filtering of the results. If user is a student, then all results related to hidden courses could be filtered early in the search. In the phase 1, a dump solution could be implemented - where solr returns all results and they are filtered out on the Moodle side.

Requirements and assumptions

Security must be handled - search should not return any results that are not accessible for the current user.
Security can be handled globally only up to some level - a module must have final say if a document is accessible. Think about more complex modules (like workshop) that can allow access to a document only at specific time (e.g. only when in peer review phase). Access information can and should be stored in the index for performance reasons but each module should be responsible for the decision of denying or granting access.
External documents must be indexed and search-able. Think about all the docs uploaded into Moodle installations as resources.
Search index can be updated asynchronously (e.g. match the live data after minutes)
Notifications should be handled using Moodle events.
Solution must scale up. Let's put some numbers here that we can use for testing - 1m documents, with average 100 words each (so 100m words in total, 1m unique)
Initial indexing must be able to stop & pick up where it left, so when a big site enables search, initial indexing can be done in chunks over days (if needed).
A search results must allow ordering by relevancy. This is a big topic in itself but for example searching for "dog cats" should list first documents that contain both words and then those with a single one only. Searching for a "dog" should mark documents that contain word "dog" in several places as more relevant than those that contain it only once.
Advanced queries should be implemented to allow for:
1. grouping of ANDs and ORs (e.g. word1 AND (word2 OR word3))
2. returning results that do not contain a keyword
3. stemming (searching for "car" will return results with "cars")
4. wildcards (searching super* will find superman and superwoman)
5. attributes search (find XX in title)
6. phrase search (find phrase "frog zombies")
Other nice to have search features but not really required:
1. proximity searching
2. fuzzy searching
3. handling case sensitivity
4. regular expressions search
The API for the module author should be as simple to implement as possible.

Implementation

The whole solution will consist of few parts:

The core Search Module
Modules API
The Search Engine

Core Search Module

The core implementation that will provide:

cron job for the incremental indexing of the content
UI pages to present indexing information
UI for search and search results
controls for clearing the index and full re-indexing
caching?

Indexing

Indexing will be triggered from cron and will run for a defined, configurable time (5 minutes default). For each registered module, indexer will run mod_get_search_iterator($from) and will remember the value of the last indexed timestamp for the next run. Iterator will return $id, as defined/understood by the module. This $id will point to a document set. Document set consists of zero or more documents that will be indexed.

Modules API

Each module will need to implement a set of API to make itself search-able.

Basic support

The basic support will make module's exposed data search-able but will not allow for real-time searches. A module will need to:

declare that it supports FEATURE_GLOBAL_SEARCH (in function <mod>_supports).
implement 3 functions:
- mod_get_search_iterator($from = 0)
- mod_get_search_documents($id)
- mod_search_access($id)

Both content and documents can be returned by modules as search documents. This is the very minimum that needs to be exposed but the following limitations will apply:

There is no way to notify the search code that new document has been added. Indexer will run again with the next cron run.
When a content is updated or created but the record's timestamp doesn't change, it will not be indexed.

Advanced support

To allow for near real-time updates to the data, a module needs to implement following events:

when data is updated

And functions:

mod_search_events() - to describe the events to the search engine

Search Engine

Will be responsible for actual indexing, processing, storing and searching. There is no decent implementation of search engine in PHP (aka Lucene; Zend Lucene is not good enough), so there will be no default search engine embedded in Moodle. Instead search engine will be de-coupled so it should be possible to plug engines like solr or sphinx.

should allow for partial indexing of the content (e.g. the first time indexer is run, it may only index part of the huge site, and pick up where is left on the next run).
should allow for indexing content from the DB and attachments (e.g. forum post & all attachments).
should process files to extract text (pdf, doc, etc).

Search API description

Initial implementation is available on github. It needs to be changed by removing Lucene and finishing the work on Moodle Search API only.

mod_get_search_iterator($from=0)

Function has to return moodle recordset with two columns:

document set id for this particular module
timestamp of the last modification

$from is used to return only documents that were modified after given timestamp. If $from equals 0 then all documents should be returned in the recordset. The meaning of the document set will differ for each module. For example, forum module will return post id as a single document set id. Inside such a document set, there will be several documents, in our case it can be actual forum post and several attachments. Security (access rights) will be managed on the document set level. In this example, you either have or don't have access to both forum post and it's attachments (fixing some bugs like MDL-29660 will be required here). Modules that want to support Global Search interface, have to keep track of the last modification time of their document. timemodified column must be indexed.

The code for this function will usually be very simple, here is a complete example for forum:

<pre>
function forum__get_search_iterator($from = 0) {
  global $DB;

  $sql = "SELECT id, modified FROM {forum_posts} WHERE modified >= ? ORDER BY modified ASC";

  return $DB->get_recordset_sql($sql, array($from));
}
</pre>

mod_get_search_documents($id)

Method returns an array of documents for given ID. For forum, it would be one document for forum post and zero or more for post attachments. Each document is a simple object of gs_document class. Following attributes can be set:

$type Document type, can be set to GS_TYPE_TEXT (plain text in $content), GS_TYPE_HTML (HTML in $content), GS_TYPE_FILE (external file)
$contextlink Link to a context of the document (e.g. forum post)
$directlink Direct link to the document (e.g. document attached to a post). By default $directlink and $contextlink are the same (you can set any of them)
$title Title of the document
$content The main content. For external files, the field is ignored.
$user Author's user object.
$module Module that has created original document.
$id Id that identifies a set of documents for a given module.
$created Timestamp when the document was created.
$modified Timestamp of the last modification.
$courseid The course ID where the document is located.
$filepath Path to the external file with the document. Set when type is GS_TYPE_FILE.
$mime Mime type of the external file. Set when type is GS_TYPE_FILE.

You can see working implementation of forum_gs_get_documents.

mod_search_access($id)

Function to check if current $USER has access to the document set $id. Return value should be one of:

GS_ACCESS_DENIED - current user can not access document set $id
GS_ACCESS_GRANTED - access granted
GS_ACCESS_DELETED - the document set with this $id does not exist

Edge cases

Backup restore

When a backup is restored, it may create a course that contains data (document sets) - e.g. user's posts. When restored, timemodified may be in past. This means that time-based iterator will not "pick up" those entries. One solution is to implement a logic in the core search to watch for "missing" IDs and query for them in hope to retrieve valid documents. This will work if modules will use numeric and incrementing IDs.

Performance

Few identified performance killers or bottlenecks.

Access check

Problem

Each document in the search result needs to be checked if it's accessible for the currently logged in user. As only a module will know for sure if a user has access to a result, this may generate enormous amount of calls to mod_search_access($id) function, which may be very expensive. Notice that possible maximum number of calls to mod_search_access($id) per search is not the number of results per page but the number of all matched documents. This is the worst case scenario, when user have access to less than results_per_page of all the results matched.

Possible solutions

It may be possible to quickly filter out at least some of the search results based on the data in the search index. For example, if user is not a participant in $courseid, search index can assume that he/she will not have access to all the results like that. In this case a list of all course IDs would need to be passed from search code into search engine.

Search API could be extended to allow modules to declare "simplified access". This would be a way of module saying: "look search engine, the access rules for my content are very simple, you can make the access decision by yourself".

Score calculation

Problem

Usually the most costly operation for a search engine is to calculate relevancy (score) of each search result. If search matches a lot of results, then time for calculating relevancy for each document may be substantial.

Possible solutions

Ideally search engine would work like this:

find matching documents
filter out documents that are not accessible for the current user
calculate relevancy

Basically, the documents that user can not access anyway should be filtered out as soon as possible.

Testing

A list of test cases.

New content

index
add new forum post
re-index

make sure new content is searchable

Updated content

index
edit existing forum post
re-index

make sure new content is searchable
make sure old content is not searchable

Deleted content

index
delete existing forum post
re-index

make sure deleted content is not searchable

Backup restored

index
restore whole course from a backup
re-index

make sure restored content is searchable

File attachments 1

Create a forum post with 3 indexable (supported by Apache Tika) documents
Index
Confirm the content of the documents is searchable

File attachments 2

Create a forum post with 3 non-indexable (not supported by Apache Tika) documents, eg executable binary files
Index
Confirm binary content didn't break anything, review what is the content of the solr storage

Ideas

Maybe you want modules to have a say in how an item shows up in the search result? Also, you may want to think about how global search fits in with the often-requested anonymity features (e.g. anonymous forum posts) -- e.g. if a non-priviledged user searches for a post's author, their "anonymous" forum post should not show up.
Not only modules could be searchable. Extra search modules could be implemented as plugins, in a similar way other Moodle plugins are handled. They could allow for doing searches on non-module related data: e.g. courses, users.

Documentation

Global search brainstorming

Contents

Update after Developer Hackfest on 2013.03.21

Requirements and assumptions

Implementation

Core Search Module

Indexing

Modules API

Basic support

Advanced support

Search Engine

Search API description

mod_get_search_iterator($from=0)

mod_get_search_documents($id)

mod_search_access($id)

Edge cases

Backup restore

Performance

Access check

Problem

Possible solutions

Score calculation

Problem

Possible solutions

Testing

New content

Updated content

Deleted content

Backup restored

File attachments 1

File attachments 2

Ideas