Search API

Revision as of 16:35, 28 November 2016 by David Monllaó (talk | contribs) (Automatic context-based filtering)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Moodle 3.1


Overview

The search API allows you to index contents in a search engine and query the search engine for results. Any Moodle component (all plugin types and all core subsystems) can define search areas for their contents.

This is different from the internal search feature some components have.

Add a search area

Note that I will be writing plugintype and pluginname to make it easier to understand for 3rd party developers not working in core, but it is also applicable for core subsystems, using core as plugintype and componentname as pluginname. I will be using areaname as the area you are defining.

Easy case: Activity information

If you just want to index your activity basic information, like the name and the description, you can skip all the documentation below. You can do it by extending \core_search\base_activity (\core_search\area\base_activity in Moodle 3.1). Copy & paste mod/book/classes/search/activity.php and replace 'book' by your 'activityname'. If you use different fields than name and intro (the defacto standard) or you want to index extra fields or files look at mod/page/classes/search/activity.php (extra fields) or mod/assign/classes/search/activity.php (files).

To index other information your activity tables contain (e.g. glossary or journal entries, assignment submissions...) you need to extend \core_search\base_mod instead. Continue reading below please.

Base class

All the search area stuff in contained in a single \plugintype_pluginname\search\areaname class and it should be extending one of the following classes:

  • \core_search\base (\core_search\area\base in Moodle 3.1): Generic base class for a search area
  • \core_search\base_mod (\core_search\area\base_mod in Moodle 3.1): Base class for activities search areas (Activity_modules)
  • \core_search\base_activity (\core_search\area\base_activity in Moodle 3.1): Base class also for activities, but more specific than base_mod as it is intended to be used to index Moodle activities basic data like the activity name and description. For other specific activity data to be indexed use base_mod. If you have doubts and you don't know which one you should use think of forum activities and forum posts, forum activities should use base_activity, but forum posts should use base_mod.

Name

You will need to set a visible name for the search area, for plugins this should be defined in the plugin's language strings file and for core subsystems in lang/en/search.php. The format to set this search area name that will be visible for Moodle users (on search page filters section or on admin search pages) is search:AREANAME, so for example for forum posts we have $string['search:posts'] = 'Forum - posts' in mod/forum/lang/en/forum.php

Index data

This function should return a recordset with all the stuff in your component that has been modified since $modifiedfrom timestamp (integer).

public function get_recordset_by_timestamp($modifiedfrom = 0) {
    global $DB;
 
    // The idea is to include here most (if not all) of the data you will need to index (see get_document below)
    $sql = "SELECT x.* FROM {xxxxx} WHERE x.timemodified >= ? ORDER BY x.timemodified ASC";
 
    // Note that this is an example, you might have more params.
    return $DB->get_recordset_sql($sql, array($modifiedfrom);
}

This function receives one of the previous query results ($record) and an array of options. It should return a \core_search\document object with all the data to be indexed.

The options this function can receive are:

  • indexfiles: Whether to index files or not. File indexing support also depends on the backend search engine, not all of them support file indexing, no need to set if the document is new if there is no filesindexing support
  • lastindexedtime: Also related with files, see example below
public function get_document($record, $options = array) {
 
    // All wrapped in a try & catch as we should not stop the indexing process because of a legacy corrupted database.
    try {
        $context = \context_course::instance($record->contextid);
    } catch (\dml_missing_record_exception $ex) {
        debugging('Error retrieving ' . $this->areaid . ' ' . $record->id . ' document, not all required data is available: ' .
            $ex->getMessage(), DEBUG_DEVELOPER);
        return false;
    } catch (\dml_exception $ex) {
        debugging('Error retrieving ' . $this->areaid . ' ' . $record->id . ' document: ' . $ex->getMessage(), DEBUG_DEVELOPER);
        return false;
    }
 
    // Prepare associative array with data from DB.
    $doc = \core_search\document_factory::instance($record->id, $this->componentname, $this->areaname);
 
    // Any content should be converted to plain text.
 
    // If you just have a text string you need to call content_to_text() with the $contentformat param set to false.
    $doc->set('title', content_to_text($record->something-that-describes-the-result));
 
    // For a property named 'content' where another property 'contentformat' is also present the text should be
    // passed through to content_to_text() when declaring the document. This will ensure that HTML, Markdown, ...
    // formats are converted to plain text. Similar to what we do with format_text.
    $doc->set('content', content_to_text($record->content, $record->contentformat));
 
    $doc->set('contextid', $context->id);
    $doc->set('type', \core_search\manager::TYPE_TEXT);
    $doc->set('courseid', $record->courseid);
    $doc->set('modified', $record->timemodified);
 
    // Optional fields.
 
    // The user that created the record. It is optional. In some cases like forum posts makes sense to have it available, but in some other cases like activities it does not help much.
    $doc->set('userid', $record->userid);
 
    // In case the indexed document should only be accessed by the user that created it replace NO_OWNER_ID constant by the owner user userid.
    $doc->set('owneruserid', \core_search\manager::NO_OWNER_ID);
 
    // Extra contents associated to the document.
    $doc->set('description1', content_to_text($record->extracontent1, $record->extracontent1format));
    $doc->set('description2', content_to_text($record->extracontent2, $record->extracontent2format));
 
    // Not compulsory, but speeds up things when the search area includes files (see [[#Indexing files]])
    if (isset($options['lastindexedtime']) && ($options['lastindexedtime'] < $record->created)) {
        // If the document was created after the last index time, it must be new.
        $doc->set_is_new(true);
    }
    return $doc;
}

Indexing files

First declare that the search area is interested in indexing the files attached to its documents.

public function uses_file_indexing() {
    return true;
}

Define attach_files function, which receives a \core_search\document object and fills it with stored_file objects.

This is a simplified example of one of the cases we can find in core.

public function attach_files($document) {
    $fs = get_file_storage();
 
    $context = \context_module::instance($document->get('contextid'));
 
    $files = $fs->get_area_files($context->id, 'COMPONENTNAME', 'FILEAREA', $document->itemid);
    foreach ($files as $file) {
        $document->add_stored_file($file);
    }
}

Indexing performance

The indexing process runs by cron incrementally, keeping track of the last indexing time... In big Moodle sites this process can be very heavy as millions of records can be indexed in the same PHP process, so performance is quite important. Think that any database query that you add in your get_document function to retrieve data that is part of your document will run for every document. If you search area can potentially contain a big number of records, you might be interested in adding static caches.

Access control

Moodle's global search limits the access to the indexed data in two different ways.

Automatic context-based filtering

This is done automatically by Moodle, the user performing a query will only have access to contents in contexts where the user have access. If your search area indexes contents that belong to a course the user will only see results that belong to the courses where they have access, if your search area belongs to a course module, only to visible activities in courses where you have access and there are no completion rules preventing them to be accessible. Similarly, if your search area belong to a user, only the current user and site admins will have access to your search area contents. Note that this last option is closely related to doc's 'owneruserid' field, the main difference is that setting 'owneruserid' to the user id will make the search area documents unavailable to admin users.

You need to specify to what context (or contexts) your search area contents belong to. Overwrite $levels static attribute in your class.

protected static $levels = [CONTEXT_COURSE]

Final access checking

Your search area is the responsible to filter out the results that

public function check_access($id) {
    try {
        $myobject = $this->get_xxxxx($id);
    } catch (\dml_missing_record_exception $ex) {
        // If the record does not exist anymore in Moodle we should return \core_search\manager::ACCESS_DELETED.
        return \core_search\manager::ACCESS_DELETED;
    } catch (\dml_exception $ex) {
        // Skip results if there is any unexpected error.
        return \core_search\manager::ACCESS_DENIED;
    }
 
    if ($myobject->visible === false) {
        return \core_search\manager::ACCESS_DENIED;
    }
 
    return \core_search\manager::ACCESS_GRANTED;
}

The automatic context-based filtering should get rid of most of the results where a user do not have access. If check_access's ratio of visible results vs non-visible results is high you might need to rethink about the context where your search area belongs or we might need to expand the APIs.

Other required methods

The link to the result. Should return a \moodle_url object.

public function get_doc_url(\core_search\document $doc) {
    // This is just an example, can vary a lot depending on what are you indexing.
    return new \moodle_url('link/to/your/component.php', array('id' => $doc->get('id')));
}

The link to the result context, in some cases it might be the same than the doc url. Should return a \moodle_url object.

public function get_context_url(\core_search\document $doc) {
    // This is just an example, can vary a lot depending on what are you indexing.
    return new \moodle_url('link/to/your/component/context.php', array('id' => $doc->get('id')));
}