Search engines

Revision as of 14:04, 12 May 2016 by Eric Merrill (talk | contribs) (Update information about file indexing.)

Jump to: navigation, search

Template:Search engine plugins

Introduction

Search engines index big amounts of data in a structured way that allow users to query them and extract relevant data. There are many search engines with nice APIs to set data on and retrieve data from. We made Moodle's global search pluggable so different backends can be used, from a simple database table (ok for small sites but unusable for big sites) to open sourced systems like solr or elasticsearch (on top of Apache Lucene) or proprietary cloud based systems.

Terms

Index: You know what it means, but in this page we use index as the data container in your search engine. It can be an instance in your search engine server or a database table name if you are writing a search engine for mongodb (just an example) Document: A "searchable" unit of data like a database entry, a forum post... You can see it as one of the search results you might expect to get returned by a search engine.

Writing your own search engine plugin

To write your own search engine you need to code methods to set, retrieve and delete data from your search engine. You will need to add a \search_yourplugin\engine class in search/engine/yourplugin/classes/engine.php extending \core_search\engine.

Search engine setup

Your search engine needs to be prepared to index data, you can have a script for your plugin users so they can easily create the required structure the search engine. Otherwise add instructions about how to do it.

You can get the list of fields Moodle needs along with some other info you might need like the field types calling \core_search\document::get_default_fields_definition()

Add contents

This method is executed when Moodle contents are being indexed in the search engine. Moodle iterates through all search areas extracting which contents should be indexed and assigns them a unique id based on the search area.

public function add_document(array $doc, $fileindexing = false) {
    // Use curl or any other method or extension to push the document to your search engine.
}

$doc will contain a document data with all required fields (+ maybe some optionals) and its contents will be already validated so a integer field will come with an integer value... $fileindexing will be true the search area that generated the document supports attached files. Will be false if your plugin does not support file indexing

File indexing

If the engine supports file indexing, and $fileindexing is passed as true to add_document (indicating the area supports indexing), then the document sent will have all files associated with it attached as stored_file instances. They are retrieved from the document with get_files(), and it It is up to the engine to determine how these are to be indexed, including content extraction.

The files attached to a document for indexing represent the authoritative set of files for that document. This means the engine should ensure that when re-indexing a document, any files no longer attached to it are not in the index.

Retrieve contents

This is the key method, as search engine plugins have a lot of flexibility here.

You will get the search filters the user specified and the list of contexts the user can access and this function should return an array of \core_search\document objects.

public function execute_query($filters, $usercontexts, $limit = 0) {
    // Prepare a query applyting all filters.
    // Include $usercontexts as a filter to contextid field.
    // Send a request to the server.
    // Iterate through results.
    // Check user access, read https://docs.moodle.org/dev/Search_engines#Security for more info
    // Convert results to '''\core_search\document''' type objects using '''\core_search\document::set_data_from_engine'''
    // Return an array of '''\core_search\document''' objects, limiting to $limit or \core_search\manager::MAX_RESULTS if empty.
}

File indexing

When retrieving results based on a file hit, you can attach stored_file instances to the document to indicate what file(s) produced the match. This information is rendered as a part of the results page. Because of rendering considerations, this should be limited to a reasonable number of 'match files' for a given document. The search_solr limits to a maximum of 3 matching files.

Security

It is crucial that this function is checking \core_search\document::check_access results and do not return results where the user do not have access. Moodle already performs part of the required security checkings, but search areas always have the last word and it should be respected.

Getting enough results

Because in some cases many records may fail check_access(), engines should make provisions to make sure enough a full set of documents is returned, even if it must check many more documents. See MDL-53758 for a better discussion of this.

Record counts

get_query_total_count() must be implemented to return the number of results that available for the most recent call to execute_query(). This is used to determine how many pages will be displayed in the paging bar. For more discussion see MDL-53758.

public function get_query_total_count() {
    // Return an approximate count of total records for the most recently completed execute_query().
}

The value can be an estimate. The search manager will ensure that if a page is requested that is beyond the last page of actual results, the user will seamlessly see the last available page.

There are a number of ways to determine what value to return from get_query_total_count(). Note that if the method you choose requires you to process more than $limit valid documents, you still must only return $limit records from execute_query(). Some of the ways to do this are:

Return how many possible there are

This would mean how many results we have processed and passed (using check_access()), plus any candidate results that are left. Alternately it is the total count of records for the query, minus the ones we have rejected so far. search_solr uses this method.

User experience: User sees a full compliment of pages. If the pass-to-fail ration on check_access() is very high (and we generally expect it to be), then the number of pages should generally be accurate. This method will always error on the high side. It is possible that when clicking on a higher page there will be no results available, so the search manager will seamlessly show them the last actual page with actual results.

Pros: Free or very cheap with some engines - no need to check access/process records beyond the current page. Reasonable user experience.

Cons: Future page count is not perfect, so can result in page not being available when clicked (but the manager mitigates this).

Return the current count plus 1

In this case, you would calculate all the records through $limit plus one. If the plus one exists, you would return that count, otherwise you would return the actual count.

User experience: User would only see up to the current page, plus one more, except when on the last page of results. Gmail search works similar to this in the way you can only navigate to the next page of results, not an arbitrary page.

Pros: Relatively cheap. Reasonable user experience.

Cons: User can’t jump to an arbitrary page even if they know what page a particular result is on

Calculate all results up to MAX_RESULTS

This would mean calculating the full set of results up to MAX_RESULTS, and returning the actual count of results. User experience: The user will see the exact number of pages they should

Pros: cleanest user experience

Cons: Very expensive, as you are calculating up to MAX_RESULTS results on every page, even if you are only showing the first page

Just return MAX_RESULTS

User experience: User will always see 10 pages, except when they are on the last page of actual results.

Pros: Free

Cons: Worst user experience

Delete contents

public function delete($areaid = false) {
    if ($areaid === false) {
        // Delete all your search engine index contents.
    } else {
        // Delete all your search engine contents where areaid = $areaid.
    }
}

\core_search\document::check_access will return \core_search\manager::ACCESS_DELETED if a document returned from the search engine is not available in Moodle any more, you can use this to clean up the search engine contents with some kind of \search_yourplugin\engine::delete_by_id method. You can look at search/engine/solr/classes/engine.php execute_query method for an example of this.

Other abstract methods you need to overwrite

public function file_indexing_enabled() {
    // Defaults to false, overwrite it if your search engine supports file indexing.
    return false;
}


public function is_server_ready() {
    // Check if your search engine is ready.
}

Other methods you might be interested in overwriting

public function is_installed() {
    // Check if the required PHP extensions you need to make the search engine work are installed.
}


public function optimize() {
    // Optimize or defragment the index contents.
}


These methods are called while the indexing process is running and allow search engine to hook the indexing process.

public function index_starting($fullindex = false) {
    // Nothing by default.
}


public function index_complete($numdocs = 0, $fullindex = false) {
    // Nothing by default.
}


public function area_index_starting($searcharea, $fullindex = false) {
    // Nothing by default.
}


public function area_index_complete($searcharea, $numdocs = 0, $fullindex = false) {
    return true;
}

Adapting document formats to your search engine format

\core_search\document is the class that represents a document, depending on your search engine backend limitations or on how it stores time values you might be interested in overwriting this class in \search_yourplugin\document. The main functions you might be interested in overwriting are:

Format date/time fields

public static function format_time_for_engine($timestamp) {
    // Convert $timestamp to a string using the format used by your search engine.
}

By default, \core_search\document::format_time_for_engine returns a timestamp (integer).

Import date/time contents from the search engine

public static function import_time_from_engine($time) {
    // Convert the string returned from the search engine as a date/time format to a timestamp (integer).
}

By default, \core_search\document::import_time_from_engine returns a timestamp (integer).

Format string fields

public static function format_string_for_engine() {
    // Limit the string length, convert iconv if your search engine only supports an specific charset...
}

By default, \core_search\document::format_string_for_engine returns the string as it is.