Search engine adapters

Jump to: navigation, search

Update 2011.10.23: for Moodle 2.3 see Global Search.

The Global Search Engine of Moodle, available as experimental feature till the Moodle 1.9 release, allows plugin new document types for being searched and indexed by the Lucene indexer.

Each module or block should have an adapter written to wrap the plugin's internal data model to a searchable document. The actual implementation allows a module to provide the search engine with a set of virtual documents. The search engine will index the text content of these documents, recording sufficiant data to access back the exact context in which it was appearing (i.e, access URL, course context, etc.).

Virtual documents are defined as subclasses of the SearchDocument class. Only the constructor of the subclass must be written in order to map an input record to an internal document definition.

The goals of the adapter are :

  • extract all virtual documents from the module data model and give them to the indexer (index first construction) through an iterator.
  • extract a single document for index update
  • provide sufficiant information for index delete of obsolete content
  • define the access URL that will access back the resource
  • define the access check algorithm so that the user will only access resources he is allowed to

Adapters For Modules And Blocks

Any adapter for a module or a block must reside in the
/search/documents
subdirectory, and will be named such as
<module>_document.php

Search defines

Each searchable module should add at least both (typical) following defines in /search/lib.php :

define('SEARCH_TYPE_<MODULE>', '<module>');
define('PATH_FOR_SEARCH_TYPE_<MODULE>', 'mod/<module>');

or

define('SEARCH_TYPE_<BLOCK>', '<block>');
define('PATH_FOR_SEARCH_TYPE_<BLOCK>', 'blocks/<block>');

Constructor

The constructor of the SearchDocument class has the following signature :

public function __construct(&$doc, &$data, $course_id, $group_id, $user_id, $path)

where :

  • &$doc is a reference onto a PHP object that should provide the fields :
    • docid : the id of the document, as suitable for reconstructing the access URL.
    • documenttype : in general, the name of the module itself
    • itemtype : a subclassifier, if the module provides more than one virtual document to the search engine.
    • contextid : the context object id that should be considered when checking accessor's capabilities
    • title : the title string to appear in search results as a caption
    • author : if the author is known, the user id (mdl_user.id) representing the author.
    • contents : a text bulk from the document content, filtered out from any formatting attributes or tags
    • url : the document url, that will be constructed by the adapter to access back the resource
    • date : usually the date when the resource was created
  • &$data is a reference onto a contextual metadata object that will be serialized among with the record, but will not be used as searchable content
  • $course_id is the current course the ressource is within
  • $group_id is the current group the resource belongs to, if the ressource is in a group scope (i.e. separate group wiki attachements), 0 elsewhere.
  • $user_id is the id of the user the resource beslongs to, in case the ressource is in a user specific scope (i.e. post or assignment attachements), 0 elsewhere.
  • $path is one of the above PATH defines for the module.

Providing The indexer Documents From The Module

When first constructing the index, The Indexer needs scanning all the instances of the plugin.

The adapter API must provide the

<module>_iterator(){ ... }

function that will give a set of consistant plugin instances. Here is a very standard template code for this method :

function <module>_iterator() {
    $<module> = get_records('<module>');
    return $<module>;
} //<module>_iterator

On each instance, the function :

function <module>_get_content_for_index(&$plugininstance) { ... }

is called for constructing relevant instances of the SearchDocument subclass. This function MUST return an array of SearchDocuments or false. The typical synopsis of this function is :

function <module>_get_content_for_index(&$plugininstance){
    $documents = array();

    // invalid plugin
    if (!$plugininstance) return $documents;

    // TODO : get an indexable item set

    foreach($indexableitems as $indexableitem) {

       // TODO : Prepare params with 

       $documents[] = new ForumSearchDocument(... params ...);
    } 
    return $documents;
}

Making The Backaccess Link

The constructor of the SearchDocument subclass must construct a backaccess link for the document, and give it as the 'url' attribute of the first constructor parameter (&$doc). this is usually done using a callback to the document API. the synopsys is :

function <module>_make_link(...contextual params...) {
    global $CFG;
    
    return $CFG->wwwroot.<moodle path expression that drives back to the content>;
} //<module>_make_link

Contextual params are usually ids of course module, or internal entities depending on the module construction, modal values...

Updating The index Database

The search engine, once fed with its first indexing results from scratch, will be regularily updating by a cron job. The index is updated in diff mode, so only modified entries in the Moodle data model should be considered.

The update process will :

  • add new items, as far as they are handled by the search engine
  • update modified items
  • drop indexes on deleted resources and contents

An API callback tells each of these three actions where the items to consider are in the Moodle database. This callback SHOULD return an array of arrays of strings, each containing the following fieldset :

  • primary id fieldname,
  • table name,
  • time created field name,
  • time modified field name,
  • itemtype,
  • [additional SELECT clause for filtering rows] // optional

Here comes a synopsys for such code. There should be as many arrays as known document subtypes in the module.

function <module>_db_names() {
    return array(
        array('id', '<module>_<entity>', 'created', 'modified', '<itemtype>', '')
    );
} //<module>_db_names

Note : 'created' and 'modified' have several expression among the variety of modules/blocks. Sometimes, both of these informations are not available. Consider using the dates of a parent dependency in case they are missing.

Knowing where the items to update/add/delete are, both first operation will use a "Single Document Wrapper" to process to the update/add operation individually. This is the purpose of the

function <module>_single_document($id, $itemtype) { ... }

function. Here comes a rough prototype for such a callback :

function <module>_single_document($id, $itemtype) {

    switch($itemtype){
        case <type1>:
           ... get content holding record ...
           ... get module_obj from previous ...
        break;
        case <type2>:
           ... get content holding record ...
           ... get module_obj from previous ...
        break;
        ... and so on ...
    }

    $coursemodule = get_field('modules', 'id', 'name', '<module>');
    $cm = get_record('course_modules', 'course', $<module_obj>->course, 
                     'module', $coursemodule, 'instance', $<module_obj>->id);
    if ($cm){
        $context = get_context_instance(CONTEXT_MODULE, $cm->id);
        ... preparing some data eventually ... 
        return new <ModuleItem>SearchDocument(get_object_vars($<content_obj>), 
               $<module_obj>->id, $<module_obj>->course, $itemtype, $context->id);
    }
    return null;
} // <module>_single_document

Note : we need use get_object_vars() as historical implementation uses a hash in constructors rather than an object.

For deletion, you will have to reproduce a callback which remains from an old unexplained implementation :

function <module>_delete($info, $itemtype) {
    $object->id = $info;
    $object->itemtype = $itemtype;
    return $object;
} // <module>_delete

Just rewrite it AS IS, nothing to think about here.

Checking Access Back To The Content

When searching, the Zend Lucene engine dig into its own word index organization, and then retrieves a set of matching "Documents". You will want now to filter out those documents that the user who activated the search engine DO NOT have permission to see.

The search results function calls back an access checker for each results, depending on the known document type. This is the :

function <module>_check_text_access($path, $itemtype, $this_id, $user, $group_id, $context_id){ ... }

function, that will have to implement the access policy, depending on the plugin's contextual logic. The function traps any access revocation condition (and than will return false). In cas no trap was triggered, it will return true. The admin role bypasses this check call and will access all documents.

Here comes a standard synopsys for that implementation :

function <module>_check_text_access($path, $itemtype, $this_id, $user, $group_id, $context_id){
    global $CFG, $USER;

    // may include some plugins libs
    include_once("{$CFG->dirroot}/{$path}/lib.php");

    ... get the course module and content record ...

    // this part is very standard for modules, reconsider it for blocks
    // or other stuff 
    $context = get_record('context', 'id', $context_id);
    $cm = get_record('course_modules', 'id', $context->instanceid);
    if (!$cm->visible and !has_capability('moodle/course:viewhiddenactivities', $context)){
        // nice for debugging undued visibility or non visibility
        if (!empty($CFG->search_access_debug)) 
                   echo "search reject : <reject cause> ";
        return false;
    }

    // user check : entries should not be owner private or we are that user    

    // group check : entries should be in accessible groups

    // date checks : we are in an open window
    
    return true;
} //<module>_check_text_access

Note : Debugging access policy

Use the above construction for any false you will return :

        if (!empty($CFG->search_access_debug)){ 
           echo "search reject : <reject cause> ";
        return false;

For debugging, add a simple key in {$CFG->prefix}_config table :

INSERT INTO {$CFG->prefix}_config VALUES ('search_access_debug', 1);

Delete it for normal operations :

DELETE FROM {$CFG->prefix}_config WHERE name = 'search_access_debug';

Physical Document Adapters