MoodleDocs - User contributions [en]

Machine learning backends

2019-12-16T08:58:48Z

Dmonllao: /* Predictor */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

Machine learning backend is a new Moodle plugin type. They are stored in lib/mlbackend, where you can add your own plugins.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @param string $uniqueid The site model unique id string
* @return null
*/
public function delete_output_dir($modeloutputdir, $uniqueid);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @param string $trainedmodeldir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_regression($uniqueid, \stored_file $dataset, $outputdir);

/**
* Estimates linear values for the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param mixed $outputdir
* @return void
*/
public function estimate($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @param string $trainedmodeldir
* @return \stdClass
*/
public function evaluate_regression($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

GSOC/2019

2019-09-05T01:45:05Z

Dmonllao: /* Adding multi-class classification to machine learning backend */

== Projects ==

=== Attendance password rotation/expiry ===

[https://moodle.org/plugins/mod_attendance The attendance plugin] provides the ability for teachers to display a QR code to allow students to take their own attendance, the QR code is currently static for the current session and does not change. This project aims to increase the security of the feature by implementing a process that frequently changes the displayed QR code and expires the old QR code, making it difficult for the QR code to be shared outside the session.

* Student: [https://moodle.org/user/profile.php?id=2628543 Mohammed Rahman]
* Mentor: Dan Marsden

* [https://docs.moodle.org/dev/images_dev/d/de/gsoc2019attendanceqr.ogg video showing the work in progress (2019-06-02)]

=== Adding multi-class classification to machine learning backend ===

Adding multi-class classification to the Moodle machine learning backend by exposing functionality to the core which is derived from python Tensorflow and the phpml library.

* Student: [https://moodle.org/user/profile.php?id=2404150 Vlad Apetrei]
* Mentor: David Monllaó
* Project outcomes https://gist.github.com/valadhi/992b808005a2cf7b7988276aa1533c92

Recommender system specification

2019-05-07T16:19:11Z

Dmonllao: /* Recommender system */

This is a proposal for a [https://en.wikipedia.org/wiki/Recommender_system recommender system] in Moodle. A recommender system ''"seeks to predict the "rating" or "preference" a user would give to an item."'', in other words, it tries to identify items that would be interesting for a user.

== Proposal info ==
This proposal is based on some assumptions:
# Recommender systems will be limited to specific contexts in most cases (e.g. a course, an activity).
## In cases where they will not be limited to a specific context (e.g. recommend a course) the number of items will not reach millions of records. The reason is that a two-dimensional array with scalar values is loaded in PHP memory.
# Recommendations can be generated on-demand, that is: we don't need to train the recommender systems in CLI tasks in the background before being able to use them. We can do it because of #1 above.
# The training data changes too often to spend resources on a complex caching system.
## Every time we have a new user in a course or a new activity (using the example recommender system described below) the dimensions of the training data change and the recommender system needs to be re-trained.
## Every time there is a new user rating (using the example recommender system below) the training data should be refreshed.
# We want the filtering to be applied before generating the training data as it modifies the dimensions of the dataset which is critical for the recommender system.

== Classes diagram ==
This is an overview of the classes involved.

[[File:Recommender_system_class_diagram.png]]

== API specs ==

=== Public API ===
This code snippet below is an example using the proposed public API. This generates two recommended activities of type '''page''' for the user with id '''111''' in the course with id '''222''' based on the values in an hypothetical '''user_activity_rates''' table. The recommender system should only consider users whose '''city''' is '''Barcelona'''.
<code php>

// The 'contexts' filter is used to restrict the recommender system to a specific set of contexts. It can be used to restrict a recommender
// system to the activities of a single course or to restrict a recommender system to the entries of a single glossary activity.
// The filters in 'dimensions' are applied to each of the dimensions used by the recommender system.
$coursecontext = \context_course::instance(222);
$filters = [
'contexts' => [$coursecontext->id],
'dimensions' => [
'user' => ['city' => 'Barcelona'],
'activity' => ['modulename' => 'page']
]
];

$dataset = new \core_course\analytics\recommender\dataset\activities($filters);

$recommender = \core_analytics\recommender($dataset);
$recommendations = $recommender->recommend(111, 2);
</code>

=== Training dataset ===
The training data for a recommender system is usually a grid of values in a two-dimensional matrix, where one of the axis usually represent the user. The classes extending the base '''recommender_dataset''' class are responsible of instantiating their dimensions and to fill the two-dimensional matrix.

''We could replace '''x''' and '''y''' for '''items''' and '''users''' if we can not find use cases that do not directly involve users.''

Example of a '''recommender_dataset''' class.
<code php>

namespace \core_course\analytics\recommender\dataset;
class activities implements \core_analytics\recommender_dataset {

public function __construct(array $filters) {
$this->filters = $filters;

$this->x = new \core_course\analytics\recommender\dimension\activity();
$this->y = new \core_user\analytics\recommender\dimension\user();
}

public function get_training_data() {
$courseids = $this->get_course_ids_from_context_filter();
$activityrates = $DB->get_records("SELECT * FROM {user_activity_rates} where courseid IN $courseids");

$xitems = $this->x->get_items($this->filters);
$yitems = $this->y->get_items($this->filters);
// Iterate through both $xitems and $yitems filling $trainingdata two-dimensional array with $activityrates values.

return $trainingdata;
}
}
</code>

=== Dimensions ===
Classes like '''user''' or '''activity''' (shown below) that extend the base class '''recommender_dimension''' represent each of dimensions in the two-dimensional matrix. They basically return the list of records used by the implementation of the '''recommender_dataset''' class. They are separated from the '''recommender_dataset''' class for re-usability in different recommender systems.

''We can remove this '''recommender_dataset recommender_dimension''' separation if we don't find enough use cases that justify the separation.''

These implementations serve as example of '''recommender_dimension''' classes.
<code php>
namespace \core_course\analytics\recommender\dimension;
class activity extends \core_analytics\recommender_dimension {

private $acceptedfilters = ['modulename', 'coursecategory'];

public function get_items(array $filters) {
// The context filtering would not make sense applied to context module if what we want is a list of activities.
return $DB->get_recordset_sql("SELECT cm.*, c.* FROM {course_modules} cm
JOIN {course} c on cm.course = c.id
JOIN {context} ctx ON ctx.contextlevel = CONTEXT_COURSE AND ctx.instanceid = c.id
WHERE ctx.id IN $contexts AND modulename = $filters['modulename']");
}
}

namespace \core_user\analytics\recommender\dimension;
class user extends \core_analytics\recommender_dimension {

public function get_items(array $filters) {
// $contexts is ignored as users depend on the system context.
return $DB->get_recordset_sql("SELECT * FROM {user}");
}
}
</code>

=== Recommender system ===

The recommender class is the key element of the whole system and can be shared across all recommender systems built using this API. Extra methods to evaluate the accuracy of the recommender system should be added.

''Recommender systems can also be used for things like predicting student grades. We need to rename or add some methods and parameters if we want this sort of usages to feel natural. For example, a '''predict($yid, $xid)''' method would be more appropriate for predicting student grades based on previous grades.''

This is the recommender class skeleton.

<code php>
namespace \core_analytics;
class recommender {

public function __construct(\core_analytics\recommender_dataset $dataset) {
$this->dataset = $dataset;
}

public function recommend($yid, $nrecommendations = 1) {
$trainingdata = $dataset->get_training_data();

// Collaborative filtering or any other alternative. This is just an example.
$model = $this->get_embeddings($trainingdata);

$y = $trainingdata[$yid];
return $model->recommend($y, $nrecommendations);
}
}
</code>

Recommender system specification

2019-05-07T15:58:56Z

Dmonllao:

This is a proposal for a [https://en.wikipedia.org/wiki/Recommender_system recommender system] in Moodle. A recommender system ''"seeks to predict the "rating" or "preference" a user would give to an item."'', in other words, it tries to identify items that would be interesting for a user.

== Proposal info ==
This proposal is based on some assumptions:
# Recommender systems will be limited to specific contexts in most cases (e.g. a course, an activity).
## In cases where they will not be limited to a specific context (e.g. recommend a course) the number of items will not reach millions of records. The reason is that a two-dimensional array with scalar values is loaded in PHP memory.
# Recommendations can be generated on-demand, that is: we don't need to train the recommender systems in CLI tasks in the background before being able to use them. We can do it because of #1 above.
# The training data changes too often to spend resources on a complex caching system.
## Every time we have a new user in a course or a new activity (using the example recommender system described below) the dimensions of the training data change and the recommender system needs to be re-trained.
## Every time there is a new user rating (using the example recommender system below) the training data should be refreshed.
# We want the filtering to be applied before generating the training data as it modifies the dimensions of the dataset which is critical for the recommender system.

== Classes diagram ==
This is an overview of the classes involved.

[[File:Recommender_system_class_diagram.png]]

== API specs ==

=== Public API ===
This code snippet below is an example using the proposed public API. This generates two recommended activities of type '''page''' for the user with id '''111''' in the course with id '''222''' based on the values in an hypothetical '''user_activity_rates''' table. The recommender system should only consider users whose '''city''' is '''Barcelona'''.
<code php>

// The 'contexts' filter is used to restrict the recommender system to a specific set of contexts. It can be used to restrict a recommender
// system to the activities of a single course or to restrict a recommender system to the entries of a single glossary activity.
// The filters in 'dimensions' are applied to each of the dimensions used by the recommender system.
$coursecontext = \context_course::instance(222);
$filters = [
'contexts' => [$coursecontext->id],
'dimensions' => [
'user' => ['city' => 'Barcelona'],
'activity' => ['modulename' => 'page']
]
];

$dataset = new \core_course\analytics\recommender\dataset\activities($filters);

$recommender = \core_analytics\recommender($dataset);
$recommendations = $recommender->recommend(111, 2);
</code>

=== Training dataset ===
The training data for a recommender system is usually a grid of values in a two-dimensional matrix, where one of the axis usually represent the user. The classes extending the base '''recommender_dataset''' class are responsible of instantiating their dimensions and to fill the two-dimensional matrix.

''We could replace '''x''' and '''y''' for '''items''' and '''users''' if we can not find use cases that do not directly involve users.''

Example of a '''recommender_dataset''' class.
<code php>

namespace \core_course\analytics\recommender\dataset;
class activities implements \core_analytics\recommender_dataset {

public function __construct(array $filters) {
$this->filters = $filters;

$this->x = new \core_course\analytics\recommender\dimension\activity();
$this->y = new \core_user\analytics\recommender\dimension\user();
}

public function get_training_data() {
$courseids = $this->get_course_ids_from_context_filter();
$activityrates = $DB->get_records("SELECT * FROM {user_activity_rates} where courseid IN $courseids");

$xitems = $this->x->get_items($this->filters);
$yitems = $this->y->get_items($this->filters);
// Iterate through both $xitems and $yitems filling $trainingdata two-dimensional array with $activityrates values.

return $trainingdata;
}
}
</code>

=== Dimensions ===
Classes like '''user''' or '''activity''' (shown below) that extend the base class '''recommender_dimension''' represent each of dimensions in the two-dimensional matrix. They basically return the list of records used by the implementation of the '''recommender_dataset''' class. They are separated from the '''recommender_dataset''' class for re-usability in different recommender systems.

''We can remove this '''recommender_dataset recommender_dimension''' separation if we don't find enough use cases that justify the separation.''

These implementations serve as example of '''recommender_dimension''' classes.
<code php>
namespace \core_course\analytics\recommender\dimension;
class activity extends \core_analytics\recommender_dimension {

private $acceptedfilters = ['modulename', 'coursecategory'];

public function get_items(array $filters) {
// The context filtering would not make sense applied to context module if what we want is a list of activities.
return $DB->get_recordset_sql("SELECT cm.*, c.* FROM {course_modules} cm
JOIN {course} c on cm.course = c.id
JOIN {context} ctx ON ctx.contextlevel = CONTEXT_COURSE AND ctx.instanceid = c.id
WHERE ctx.id IN $contexts AND modulename = $filters['modulename']");
}
}

namespace \core_user\analytics\recommender\dimension;
class user extends \core_analytics\recommender_dimension {

public function get_items(array $filters) {
// $contexts is ignored as users depend on the system context.
return $DB->get_recordset_sql("SELECT * FROM {user}");
}
}
</code>

=== Recommender system ===
This is the recommender class skeleton.

''Recommender systems can also be used for things like predicting student grades. We need to rename or add some methods and parameters if we want this sort of usages to feel natural. For example, a '''predict($yid, $xid)''' method would be more appropriate for predicting student grades based on previous grades.''

<code php>
namespace \core_analytics;
class recommender {

public function __construct(\core_analytics\recommender_dataset $dataset) {
$this->dataset = $dataset;
}

public function recommend($yid, $nrecommendations = 1) {
$trainingdata = $dataset->get_training_data();

// Collaborative filtering or any other alternative. This is just an example.
$model = $this->get_embeddings($trainingdata);

$y = $trainingdata[$yid];
return $model->recommend($y, $nrecommendations);
}
}
</code>

Recommender system specification

2019-05-07T15:55:13Z

Dmonllao:

This is a proposal for a [https://en.wikipedia.org/wiki/Recommender_system recommender system] in Moodle. A recommender system ''"seeks to predict the "rating" or "preference" a user would give to an item."'', in other words, it tries to identify items that would be interesting for a user.

== Proposal info ==
This proposal is based on some assumptions:
# Recommender systems will be limited to specific contexts in most cases (e.g. a course, an activity).
## In cases where they will not be limited to a specific context (e.g. recommend a course) the number of items will not reach millions of records. The reason is that a two-dimensional array with scalar values is loaded in PHP memory.
# Recommendations can be generated on-demand, that is: we don't need to train the recommender systems in CLI tasks in the background before being able to use them. We can do it because #1 above.
# The training data changes too often to spend resources on a complex caching system.
## Every time we have a new user in a course or a new activity (using the example recommender system described below) the dimensions of the training data change and the recommender system needs to be re-trained.
## Every time there is a new user rating (using the example recommender system below) the training data should be refreshed.
# We want the filtering to be applied before generating the training data as it modifies the dimensions of the dataset which is critical for the recommender system.

== Classes diagram ==
This is an overview of the classes involved.

[[File:Recommender_system_class_diagram.png]]

== API specs ==

=== Public API ===
This code snippet below is an example using the proposed public API. This generates two recommended activities of type ''page'' for the user with id 111 in the course with id 222 based on the values in an hypothetical ''user_activity_rates'' table. The recommender system should only consider users whose ''city'' is ''Barcelona''.
<code php>

// The 'contexts' filter is used to restrict the recommender system to a specific set of contexts. It can be used to restrict a recommender
// system to the activities of a single course or to restrict a recommender system to the entries of a single glossary activity.
// The filters in 'dimensions' are applied to each of the dimensions used by the recommender system.
$coursecontext = \context_course::instance(222);
$filters = [
'contexts' => [$coursecontext->id],
'dimensions' => [
'user' => ['city' => 'Barcelona'],
'activity' => ['modulename' => 'page']
]
];

$dataset = new \core_course\analytics\recommender\dataset\activities($filters);

$recommender = \core_analytics\recommender($dataset);
$recommendations = $recommender->recommend(111, 2);
</code>

=== Training dataset ===
The training data for a recommender system is usually a grid of values in a two-dimensional matrix, where one of the axis usually represent the user. The classes extending the base '''recommender_dataset''' class are responsible of instantiating their dimensions and to fill the two-dimensional matrix.

''We could replace '''x''' and '''y''' for '''items''' and '''users''' if we can not find use cases that do not directly involve users.''

Example of a '''recommender_dataset''' class.
<code php>

namespace \core_course\analytics\recommender\dataset;
class activities implements \core_analytics\recommender_dataset {

public function __construct(array $filters) {
$this->filters = $filters;

$this->x = new \core_course\analytics\recommender\dimension\activity();
$this->y = new \core_user\analytics\recommender\dimension\user();
}

public function get_training_data() {
$courseids = $this->get_course_ids_from_context_filter();
$activityrates = $DB->get_records("SELECT * FROM {user_activity_rates} where courseid IN $courseids");

$xitems = $this->x->get_items($this->filters);
$yitems = $this->y->get_items($this->filters);
// Iterate through both $xitems and $yitems filling $trainingdata two-dimensional array with $activityrates values.

return $trainingdata;
}
}
</code>

=== Dimensions ===
Classes like '''user''' or '''activity''' that extend the base class '''recommender_dimension''' represent each of dimensions in the two-dimensional matrix. They basically return the list of records used by the implementation of the '''recommender_dataset''' class. They are separated from the '''recommender_dataset'' class for re-usability in different recommender systems.

''We can remove this '''recommender_dataset - recommender_dimension''' separation if we don't find enough use cases that justify the separation.''

These implementations serve as example of '''recommender_dimension''' classes.
<code php>
namespace \core_course\analytics\recommender\dimension;
class activity extends \core_analytics\recommender_dimension {

private $acceptedfilters = ['modulename', 'coursecategory'];

public function get_items(array $filters) {
// The context filtering would not make sense applied to context module if what we want is a list of activities.
return $DB->get_recordset_sql("SELECT cm.*, c.* FROM {course_modules} cm
JOIN {course} c on cm.course = c.id
JOIN {context} ctx ON ctx.contextlevel = CONTEXT_COURSE AND ctx.instanceid = c.id
WHERE ctx.id IN $contexts AND modulename = $filters['modulename']");
}
}

namespace \core_user\analytics\recommender\dimension;
class user extends \core_analytics\recommender_dimension {

public function get_items(array $filters) {
// $contexts is ignored as users depend on the system context.
return $DB->get_recordset_sql("SELECT * FROM {user}");
}
}
</code>

=== Recommender system ===
This is the recommender class skeleton.

''Recommender systems can also be used for things like predicting student grades. We need to rename or add some methods and parameters if we want this sort of usages to feel natural. For example, a '''predict($yid, $xid)''' method would be more appropriate for predicting student grades based on previous grades.''

<code php>
namespace \core_analytics;
class recommender {

public function __construct(\core_analytics\recommender_dataset $dataset) {
$this->dataset = $dataset;
}

public function recommend($yid, $nrecommendations = 1) {
$trainingdata = $dataset->get_training_data();

// Collaborative filtering or any other alternative. This is just an example.
$model = $this->get_embeddings($trainingdata);

$y = $trainingdata[$yid];
return $model->recommend($y, $nrecommendations);
}
}
</code>

File:Recommender system class diagram.png

2019-05-07T15:28:18Z

Dmonllao:

Recommender system specification

2019-05-07T15:27:54Z

Dmonllao:

This is a proposal for a [https://en.wikipedia.org/wiki/Recommender_system recommender system] in Moodle. A recommender system ''"seeks to predict the "rating" or "preference" a user would give to an item."'', in other words, it tries to identify items that would be interesting for a user.

This code snippet below is an example using the proposed public API. This generates two recommended activities of type ''page'' for the user with id 111 in the course with id 222 based on the values in an hypothetical ''user_activity_rates'' table. The recommender system should only consider users whose ''city'' is ''Barcelona''.
<code php>

// The 'contexts' filter is used to restrict the recommender system to a specific set of contexts. It can be used to restrict a recommender
// system to the activities of a single course or to restrict a recommender system to the entries of a single glossary activity.
// The filters in 'dimensions' are applied to each of the dimensions used by the recommender system.
$coursecontext = \context_course::instance(222);
$filters = [
'contexts' => [$coursecontext->id],
'dimensions' => [
'user' => ['city' => 'Barcelona'],
'activity' => ['modulename' => 'page']
]
];

$dataset = new \core_course\analytics\recommender\dataset\activities($filters);

$recommender = \core_analytics\recommender($dataset);
$recommendations = $recommender->recommend(111, 2);
</code>

This is an overview of the classes involved.

The training data for a recommender system is usually a grid of values in a two-dimensional matrix, where one of the axis usually represent the user. The classes extending the base '''recommender_dataset''' class are responsible of instantiating their dimensions and to fill the two-dimensional matrix.

''We could replace '''x''' and '''y''' for '''items''' and '''users''' if we can not find use cases that do not directly involve users.''

Example of a '''recommender_dataset''' class.
<code php>

namespace \core_course\analytics\recommender\dataset;
class activities implements \core_analytics\recommender_dataset {

public function __construct(array $filters) {
$this->filters = $filters;

$this->x = new \core_course\analytics\recommender\dimension\activity();
$this->y = new \core_user\analytics\recommender\dimension\user();
}

public function get_training_data() {
$courseids = $this->get_course_ids_from_context_filter();
$activityrates = $DB->get_records("SELECT * FROM {user_activity_rates} where courseid IN $courseids");

$xitems = $this->x->get_items($this->filters);
$yitems = $this->y->get_items($this->filters);
// Iterate through both $xitems and $yitems filling $trainingdata two-dimensional array with $activityrates values.

return $trainingdata;
}
}
</code>

Classes like '''user''' or '''activity''' that extend the base class '''recommender_dimension''' represent each of dimensions in the two-dimensional matrix. They basically return the list of records used by the implementation of the '''recommender_dataset''' class. They are separated from the '''recommender_dataset'' class for re-usability in different recommender systems.

''We can remove this '''recommender_dataset - recommender_dimension''' separation if we don't find enough use cases that justify the separation.''

These implementations serve as example of '''recommender_dimension''' classes.
<code php>
namespace \core_course\analytics\recommender;
class activity extends \core_analytics\recommender_dimension {

private $acceptedfilters = ['modulename', 'coursecategory'];

public function get_items(array $filters) {
// The context filtering would not make sense applied to context module if what we want is a list of activities.
return $DB->get_recordset_sql("SELECT cm.*, c.* FROM {course_modules} cm
JOIN {course} c on cm.course = c.id
JOIN {context} ctx ON ctx.contextlevel = CONTEXT_COURSE AND ctx.instanceid = c.id
WHERE ctx.id IN $contexts AND modulename = $filters['modulename']");
}
}

namespace \core_user\analytics\recommender;
class user extends \core_analytics\recommender_dimension {

public function get_items(array $filters) {
// $contexts is ignored as users depend on the system context.
return $DB->get_recordset_sql("SELECT * FROM {user}");
}
}
</code>

This is the recommender class skeleton.
<code php>
namespace \core_analytics;
class recommender {

public function __construct(\core_analytics\recommender_dataset $dataset) {
$this->dataset = $dataset;
}

public function recommend($yid, $nrecommendations = 1) {
$trainingdata = $dataset->get_training_data();

// Collaborative filtering or any other alternative. This is just an example.
$model = $this->get_embeddings($trainingdata);

$y = $trainingdata[$yid];
return $model->recommend($y, $nrecommendations);
}
}
</code>

Recommender system specification

2019-05-07T07:53:53Z

Dmonllao: Created page with "This is a proposal for a [https://en.wikipedia.org/wiki/Recommender_system recommender system] in Moodle. A recommender system ''"seeks to predict the "rating" or "preference"..."

File:Inspire data flow.png

2019-04-29T15:48:06Z

Dmonllao: Dmonllao uploaded a new version of File:Inspire data flow.png

Machine learning backends

2019-03-27T10:49:05Z

Dmonllao: /* Introduction */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

Machine learning backend is a new Moodle plugin type. They are stored in lib/mlbackend, where you can add your own plugins.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @return null
*/
public function delete_output_dir($modeloutputdir);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_regression($uniqueid, \stored_file $dataset, $outputdir);

/**
* Estimates linear values for the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param mixed $outputdir
* @return void
*/
public function estimate($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_regression($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

Projects for new developers

2019-02-08T12:04:19Z

Dmonllao:

{{GSOC}}
==Getting started==

* Moodle uses PHP, JavaScript, SQL and a number of other Web languages, so learning those is a good place to start.
* When you have some basic PHP programming skills, you may wish to start learning about how the Moodle code is organised. It is recommended that you go through the [[Tutorial]].
* If you are looking for projects suggested in the tracker, look for issues with the [https://tracker.moodle.org/issues/?jql=labels%20in%20%28addon_candidate%29 'addon_candidate' label].
* If you are looking to make a quick contribution, look for tracker issues with marked as [https://tracker.moodle.org/issues/?jql=Difficulty%20%3D%20Easy easy].
* As you become more involved in Moodle development, you might like to learn more about the [[Coding|coding conventions]] used and how changes to Moodle core code are [[Process|processed]]. Once you become confident enough, please consider adopting a [https://moodle.org/plugins/browse.php?list=set&id=61 plugin seeking a new maintainer].

==Potential projects==

This evolving page lists possible Moodle projects for new developers derived from community suggestions and lists projects together with experienced core developers willing to mentor new developers.

''If you have any ideas for new features in Moodle which might be suitable as projects for new developers, please see [[New feature ideas]].''

=== Improve SCORM plugin ===
There are a number of areas of SCORM that could be improved as part of a GSOC project, some of these are bigger projects and others could be combined to form a single project.

These are just some examples, take a look at the open SCORM issues in the Moodle tracker for a list of other issues.
* Improve Grading (MDL-51086, MDL-52871, MDL-37421)
* Improve validation of SCORM packages (MDL-38060, MDL-24057)
* Convert YUI Treeview to use Jquery (Moodle is moving away from YUI and the existing Treeview has a few issues)
* Choose where to send users after completing SCORM (MDL-61677)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.
:'''Skills required''': PHP
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=21591&course=5 Dan Marsden]

=== Acceptance tests for the Moodle app ===

Since Moodle 3.7 it will be possible to write and run acceptance tests for the Moodle app.

Tasks:
* Write new acceptance tests for the Moodle app

Requirement for prospective students:

* We require prospective students to set-up and run in a local environment the existing tests following this documentation: [[Acceptance testing for the mobile app]], students must record a video of the tests running on a local machine.
* We also require students to create an additional simple test (detailed instructions for writing tests are available in the previous link)

:'''Skills required:''' Behat (PHP)
:'''Difficulty level:''' Medium
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=49568 Juan Leyva]

=== Front-end editor for the plugin skeleton generator ===

This is a follow-up project for a [[GSOC/2016#Plugin skeleton generator|successful GSOC 2016 project]] that resulted in a new tool allowing developers to quickly generate a skeleton (scaffolding, template) for a new Moodle plugin. The tool proved to be a helpful helper with significant impact on the quality of Moodle plugins code. This follow-up project aims at further improvements of the skeleton generator. The primary goal is to implement a developer-friendly user interface / front-end editor allowing to configure the plugin's properties (recipe file) easily. The UI should guide the developer through the process of designing and defining the plugin properties and facilitate the whole process.

* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Skills required''': PHP + JS
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=1601&course=5 David Mudrák]

=== Add multi-class capabilities to Moodle's machine learning backends ===

Moodle includes an analytics API that uses machine learning for binary classification. This is enough for classification problems like "student at risk" vs "student not at risk". We want to expand this API capabilities by supporting multi-class classification, so we could write models like "very low grade", "low grade", "pass", "best student ever".

Tasks:
* Modify the two machine learning backends included in Moodle core to support multi-class classification problems. This includes the PHP ML backend (based on php-ml library) and the Python ML backend (Tensorflow).

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Tracker issue:''' https://tracker.moodle.org/browse/MDL-58992
:'''Skills required:''' PHP + Python + basic understanding of machine learning algorithms and TensorFlow
:'''Difficulty level:''' Medium/High
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=122326 David Monllaó]

=== Add regressors to core machine learning backends ===

Moodle includes an analytics API that uses machine learning for binary classification. This is enough for classification problems like "student at risk" vs "student not at risk". We want to expand this API capabilities to support regression, so we can write models that estimate linear values instead of classes.

Tasks:
* Modify the two machine learning backends included in Moodle core to support regression. This includes the PHP ML backend (based on php-ml library) and the Python ML backend (Tensorflow).

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Tracker issue:''' https://tracker.moodle.org/browse/MDL-60523
:'''Skills required:''' PHP + Python + basic understanding of machine learning algorithms and TensorFlow
:'''Difficulty level:''' Medium/High
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=122326 David Monllaó]

==See also==

* [[GSOC]] - describing Moodle's involvement with Google in their Summer of Code program
* [https://tracker.moodle.org/issues/?jql=type%20in%20%28%22New%20Feature%22%2C%20Improvement%29%20AND%20resolution%20%3D%20unresolved%20and%20labels%20in%20%28addon_candidate%29%20ORDER%20BY%20votes%20DESC Popular new feature and improvement requests in Tracker that can be implemented as plugins]

Projects for new developers

2019-02-08T12:03:29Z

Dmonllao: /* Add multi-class capabilities to Moodle's machine learning backends */

{{GSOC}}
==Getting started==

* Moodle uses PHP, JavaScript, SQL and a number of other Web languages, so learning those is a good place to start.
* When you have some basic PHP programming skills, you may wish to start learning about how the Moodle code is organised. It is recommended that you go through the [[Tutorial]].
* If you are looking for projects suggested in the tracker, look for issues with the [https://tracker.moodle.org/issues/?jql=labels%20in%20%28addon_candidate%29 'addon_candidate' label].
* If you are looking to make a quick contribution, look for tracker issues with marked as [https://tracker.moodle.org/issues/?jql=Difficulty%20%3D%20Easy easy].
* As you become more involved in Moodle development, you might like to learn more about the [[Coding|coding conventions]] used and how changes to Moodle core code are [[Process|processed]]. Once you become confident enough, please consider adopting a [https://moodle.org/plugins/browse.php?list=set&id=61 plugin seeking a new maintainer].

==Potential projects==

This evolving page lists possible Moodle projects for new developers derived from community suggestions and lists projects together with experienced core developers willing to mentor new developers.

''If you have any ideas for new features in Moodle which might be suitable as projects for new developers, please see [[New feature ideas]].''

=== Improve SCORM plugin ===
There are a number of areas of SCORM that could be improved as part of a GSOC project, some of these are bigger projects and others could be combined to form a single project.

These are just some examples, take a look at the open SCORM issues in the Moodle tracker for a list of other issues.
* Improve Grading (MDL-51086, MDL-52871, MDL-37421)
* Improve validation of SCORM packages (MDL-38060, MDL-24057)
* Convert YUI Treeview to use Jquery (Moodle is moving away from YUI and the existing Treeview has a few issues)
* Choose where to send users after completing SCORM (MDL-61677)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.
:'''Skills required''': PHP
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=21591&course=5 Dan Marsden]

=== Acceptance tests for the Moodle app ===

Since Moodle 3.7 it will be possible to write and run acceptance tests for the Moodle app.

Tasks:
* Write new acceptance tests for the Moodle app

Requirement for prospective students:

* We require prospective students to set-up and run in a local environment the existing tests following this documentation: [[Acceptance testing for the mobile app]], students must record a video of the tests running on a local machine.
* We also require students to create an additional simple test (detailed instructions for writing tests are available in the previous link)

:'''Skills required:''' Behat (PHP)
:'''Difficulty level:''' Medium
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=49568 Juan Leyva]

=== Front-end editor for the plugin skeleton generator ===

This is a follow-up project for a [[GSOC/2016#Plugin skeleton generator|successful GSOC 2016 project]] that resulted in a new tool allowing developers to quickly generate a skeleton (scaffolding, template) for a new Moodle plugin. The tool proved to be a helpful helper with significant impact on the quality of Moodle plugins code. This follow-up project aims at further improvements of the skeleton generator. The primary goal is to implement a developer-friendly user interface / front-end editor allowing to configure the plugin's properties (recipe file) easily. The UI should guide the developer through the process of designing and defining the plugin properties and facilitate the whole process.

* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Skills required''': PHP + JS
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=1601&course=5 David Mudrák]

=== Add multi-class capabilities to Moodle's machine learning backends ===

Moodle includes an analytics API that uses machine learning for binary classification. This is enough for classification problems like "student at risk" vs "student not at risk". We want to expand this API capabilities by supporting multi-class classification, so we could write models like "very low grade", "low grade", "pass", "best student ever".

Tasks:
* Modify the two machine learning backends included in Moodle core to support multi-class classification problems. This includes the PHP ML backend (based on php-ml library) and the Python ML backend (Tensorflow).

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Tracker issue:''' https://tracker.moodle.org/browse/MDL-58992
:'''Skills required:''' PHP + Python + basic understanding of machine learning algorithms and TensorFlow
:'''Difficulty level:''' Medium/High
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=122326 David Monllaó]

==See also==

* [[GSOC]] - describing Moodle's involvement with Google in their Summer of Code program
* [https://tracker.moodle.org/issues/?jql=type%20in%20%28%22New%20Feature%22%2C%20Improvement%29%20AND%20resolution%20%3D%20unresolved%20and%20labels%20in%20%28addon_candidate%29%20ORDER%20BY%20votes%20DESC Popular new feature and improvement requests in Tracker that can be implemented as plugins]

Projects for new developers

2019-02-08T11:58:50Z

Dmonllao: /* Add multi-class capabilities to Moodle's machine learning backends */

{{GSOC}}
==Getting started==

* Moodle uses PHP, JavaScript, SQL and a number of other Web languages, so learning those is a good place to start.
* When you have some basic PHP programming skills, you may wish to start learning about how the Moodle code is organised. It is recommended that you go through the [[Tutorial]].
* If you are looking for projects suggested in the tracker, look for issues with the [https://tracker.moodle.org/issues/?jql=labels%20in%20%28addon_candidate%29 'addon_candidate' label].
* If you are looking to make a quick contribution, look for tracker issues with marked as [https://tracker.moodle.org/issues/?jql=Difficulty%20%3D%20Easy easy].
* As you become more involved in Moodle development, you might like to learn more about the [[Coding|coding conventions]] used and how changes to Moodle core code are [[Process|processed]]. Once you become confident enough, please consider adopting a [https://moodle.org/plugins/browse.php?list=set&id=61 plugin seeking a new maintainer].

==Potential projects==

This evolving page lists possible Moodle projects for new developers derived from community suggestions and lists projects together with experienced core developers willing to mentor new developers.

''If you have any ideas for new features in Moodle which might be suitable as projects for new developers, please see [[New feature ideas]].''

=== Improve SCORM plugin ===
There are a number of areas of SCORM that could be improved as part of a GSOC project, some of these are bigger projects and others could be combined to form a single project.

These are just some examples, take a look at the open SCORM issues in the Moodle tracker for a list of other issues.
* Improve Grading (MDL-51086, MDL-52871, MDL-37421)
* Improve validation of SCORM packages (MDL-38060, MDL-24057)
* Convert YUI Treeview to use Jquery (Moodle is moving away from YUI and the existing Treeview has a few issues)
* Choose where to send users after completing SCORM (MDL-61677)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.
:'''Skills required''': PHP
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=21591&course=5 Dan Marsden]

=== Acceptance tests for the Moodle app ===

Since Moodle 3.7 it will be possible to write and run acceptance tests for the Moodle app.

Tasks:
* Write new acceptance tests for the Moodle app

Requirement for prospective students:

* We require prospective students to set-up and run in a local environment the existing tests following this documentation: [[Acceptance testing for the mobile app]], students must record a video of the tests running on a local machine.
* We also require students to create an additional simple test (detailed instructions for writing tests are available in the previous link)

:'''Skills required:''' Behat (PHP)
:'''Difficulty level:''' Medium
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=49568 Juan Leyva]

=== Front-end editor for the plugin skeleton generator ===

This is a follow-up project for a [[GSOC/2016#Plugin skeleton generator|successful GSOC 2016 project]] that resulted in a new tool allowing developers to quickly generate a skeleton (scaffolding, template) for a new Moodle plugin. The tool proved to be a helpful helper with significant impact on the quality of Moodle plugins code. This follow-up project aims at further improvements of the skeleton generator. The primary goal is to implement a developer-friendly user interface / front-end editor allowing to configure the plugin's properties (recipe file) easily. The UI should guide the developer through the process of designing and defining the plugin properties and facilitate the whole process.

* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Skills required''': PHP + JS
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=1601&course=5 David Mudrák]

=== Add multi-class capabilities to Moodle's machine learning backends ===

Moodle includes an analytics API that uses machine learning for binary classification. This is enough for classification problems like "student at risk" vs "student not at risk". We want to expand this API capabilities by supporting multi-class classification, so we could write models like "very low grade", "low grade", "pass", "best student ever".

Tasks:
* Modify the two machine learning backends included in Moodle core to support multi-class classification problems. This includes the PHP ML backend (based on php-ml library) and the Python ML backend (Tensorflow)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Tracker issue:''' https://tracker.moodle.org/browse/MDL-58992
:'''Skills required:''' PHP + Python + basic understanding of machine learning algorithms and TensorFlow
:'''Difficulty level:''' Medium/High
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=122326 David Monllaó]

==See also==

* [[GSOC]] - describing Moodle's involvement with Google in their Summer of Code program
* [https://tracker.moodle.org/issues/?jql=type%20in%20%28%22New%20Feature%22%2C%20Improvement%29%20AND%20resolution%20%3D%20unresolved%20and%20labels%20in%20%28addon_candidate%29%20ORDER%20BY%20votes%20DESC Popular new feature and improvement requests in Tracker that can be implemented as plugins]

Projects for new developers

2019-02-08T11:57:51Z

Dmonllao:

{{GSOC}}
==Getting started==

* Moodle uses PHP, JavaScript, SQL and a number of other Web languages, so learning those is a good place to start.
* When you have some basic PHP programming skills, you may wish to start learning about how the Moodle code is organised. It is recommended that you go through the [[Tutorial]].
* If you are looking for projects suggested in the tracker, look for issues with the [https://tracker.moodle.org/issues/?jql=labels%20in%20%28addon_candidate%29 'addon_candidate' label].
* If you are looking to make a quick contribution, look for tracker issues with marked as [https://tracker.moodle.org/issues/?jql=Difficulty%20%3D%20Easy easy].
* As you become more involved in Moodle development, you might like to learn more about the [[Coding|coding conventions]] used and how changes to Moodle core code are [[Process|processed]]. Once you become confident enough, please consider adopting a [https://moodle.org/plugins/browse.php?list=set&id=61 plugin seeking a new maintainer].

==Potential projects==

This evolving page lists possible Moodle projects for new developers derived from community suggestions and lists projects together with experienced core developers willing to mentor new developers.

''If you have any ideas for new features in Moodle which might be suitable as projects for new developers, please see [[New feature ideas]].''

=== Improve SCORM plugin ===
There are a number of areas of SCORM that could be improved as part of a GSOC project, some of these are bigger projects and others could be combined to form a single project.

These are just some examples, take a look at the open SCORM issues in the Moodle tracker for a list of other issues.
* Improve Grading (MDL-51086, MDL-52871, MDL-37421)
* Improve validation of SCORM packages (MDL-38060, MDL-24057)
* Convert YUI Treeview to use Jquery (Moodle is moving away from YUI and the existing Treeview has a few issues)
* Choose where to send users after completing SCORM (MDL-61677)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.
:'''Skills required''': PHP
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=21591&course=5 Dan Marsden]

=== Acceptance tests for the Moodle app ===

Since Moodle 3.7 it will be possible to write and run acceptance tests for the Moodle app.

Tasks:
* Write new acceptance tests for the Moodle app

Requirement for prospective students:

* We require prospective students to set-up and run in a local environment the existing tests following this documentation: [[Acceptance testing for the mobile app]], students must record a video of the tests running on a local machine.
* We also require students to create an additional simple test (detailed instructions for writing tests are available in the previous link)

:'''Skills required:''' Behat (PHP)
:'''Difficulty level:''' Medium
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=49568 Juan Leyva]

=== Front-end editor for the plugin skeleton generator ===

This is a follow-up project for a [[GSOC/2016#Plugin skeleton generator|successful GSOC 2016 project]] that resulted in a new tool allowing developers to quickly generate a skeleton (scaffolding, template) for a new Moodle plugin. The tool proved to be a helpful helper with significant impact on the quality of Moodle plugins code. This follow-up project aims at further improvements of the skeleton generator. The primary goal is to implement a developer-friendly user interface / front-end editor allowing to configure the plugin's properties (recipe file) easily. The UI should guide the developer through the process of designing and defining the plugin properties and facilitate the whole process.

* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Skills required''': PHP + JS
:'''Difficulty level''': Medium
:'''Possible mentor''': [http://moodle.org/user/view.php?id=1601&course=5 David Mudrák]

=== Add multi-class capabilities to Moodle's machine learning backends ===

Moodle includes an analytics API that uses machine learning for binary classification. This is enough for classification problems like "student at risk" vs "student not at risk". We want to expand this API capabilities by supporting multi-class classification, so we could write models like "very low grade", "low grade", "pass", "best student ever".

Tasks:
* Modify the two machine learning backends included in Moodle core to support multi-class classification problems. This includes the PHP ML backend (based on php-ml library) and the Python ML backend (Tensorflow)

Requirement for prospective students:
* We require prospective students to make an attempt at fixing at least 1 issue in the Moodle tracker before their proposal can be considered. This MUST be completed before your application can be considered valid.

:'''Tracker issue:''' https://tracker.moodle.org/browse/MDL-60523
:'''Skills required:''' PHP + Python + basic understanding of machine learning algorithms and TensorFlow
:'''Difficulty level:''' Medium/High
:'''Possible mentor:''' [https://moodle.org/user/profile.php?id=122326 David Monllaó]

==See also==

* [[GSOC]] - describing Moodle's involvement with Google in their Summer of Code program
* [https://tracker.moodle.org/issues/?jql=type%20in%20%28%22New%20Feature%22%2C%20Improvement%29%20AND%20resolution%20%3D%20unresolved%20and%20labels%20in%20%28addon_candidate%29%20ORDER%20BY%20votes%20DESC Popular new feature and improvement requests in Tracker that can be implemented as plugins]

Analytics API

2018-12-04T15:42:19Z

Dmonllao: /* Create the model */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4+ provides two built-in models:

* [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

'''processes_user_data''' and '''join_sample_user''' methods are used by the analytics implementation of the privacy API. You only need to overwrite them if your analyser deals with user data. They are used to export and delete user data that is stored in analytics database tables:

/**
* Whether the plugin needs user data clearing or not.
*
* @return bool
*/
public function processes_user_data();
/**
* SQL JOIN from a sample to users table.
*
* More info in [https://github.com/moodle/moodle/blob/master/analytics/classes/local/analyser/base.php core_analytics\local\analyser\base]::join_sample_user
*
* @param string $sampletablealias The alias of the table with a sampleid field that will join with this SQL string
* @return string
*/
public function join_sample_user($sampletablealias);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators. In case you want your activity module to implement a [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out#Indicators community of inquiry] indicator you can extend ''core_analytics\local\indicator\community_of_inquiry_indicator'' look for examples in Moodle core.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is not unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

The first thing a target must define is the analyser class that it will use. The analyser class is specified in '''get_analyser_class'''.

/**
* Returns the analyser class that should be used along with this target.
*
* @return string The full class name as a string
*/
abstract public function get_analyser_class();

'''is_valid_analysable''' and '''is_valid_sample''' are used to discard elements that are not valid for your target.

/**
* Allows the target to verify that the analysable is a good candidate.
*
* This method can be used as a quick way to discard invalid analysables.
* e.g. Imagine that your analysable don't have students and you need them.
*
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return true|string
*/
public function is_valid_analysable(\core_analytics\analysable $analysable, $fortraining = true);

/**
* Is this sample from the $analysable valid?
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return bool
*/
public function is_valid_sample($sampleid, \core_analytics\analysable $analysable, $fortraining = true);

'''calculate_sample''' is the method that calculates the target value.

/**
* Calculates this target for the provided samples.
*
* In case there are no values to return or the provided sample is not applicable just return null.
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param int|false $starttime Limit calculations to start time
* @param int|false $endtime Limit calculations to end time
* @return float|null
*/
protected function calculate_sample($sampleid, \core_analytics\analysable $analysable, $starttime = false, $endtime = false);

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

Time-splitting methods are useful to define when the analytics API will train the predictions processor and when it will generate predictions. As explained above in [[Analytics_API#Time_splitting_methods]], they define time ranges based on analysable elements start and end timestamps.

The base class is '''\core_analytics\local\time_splitting\base'''; if what you are after is to split the analysable duration in equal parts or in cumulative parts you can extend '''\core_analytics\local\time_splitting\equal_parts''' or '''\core_analytics\local\time_splitting\accumulative_parts''' instead.

'''define_ranges''' is the main method to implement and its values mostly depend on the current analysable element (available in '''$this->analysable'''). An array of time ranges should be returned, each of these ranges should contain 3 attributes: A start time ('start') and an end time ('end') that will be passed to indicators so they can limit the amount of activity logs they read; the 3rd attribute is 'time', which value will determine when the range will be executed.

/**
* Define the time splitting methods ranges.
*
* 'time' value defines when predictions are executed, their values will be compared with
* the current time in ready_to_predict
*
* @return array('start' => time(), 'end' => time(), 'time' => time())
*/
protected function define_ranges();

A name and description should also be specified:

/**
* Returns a lang_string object representing the name for the time splitting method.
*
* Used as column identificator.
*
* If there is a corresponding '_help' string this will be shown as well.
*
* @return \lang_string
*/
public static function get_name() : \lang_string;

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

New models can be created and implemented in php, and can be packaged as a Moodle local plugin for distribution. Sample model components and models are provided at https://github.com/dmonllao/moodle-local_testanalytics.

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

To add a new model to the system, it must be defined in a PHP file. Normally this is done as part of install.php or upgrade.php for a plugin that contains the new model and components. However, it is also possible to execute the necessary commands in a standalone PHP file that references the Moodle config.php.

To create the model, specify at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
// Note that the 3rd and 4rd arguments (the time splitting method and the predictions processor) are optional. The 4th argument is available from Moodle 3.6 onwards.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range', '\mlbackend_python\processor');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Tutorial

2018-07-17T07:10:16Z

Dmonllao: /* Maintaining good security */

Welcome to Moodle development!

This is a tutorial to help you learn how to write plugins for Moodle from start to finish, while showing you how to navigate the most important developer documentation along the way.

PRE-REQUISITES: We assume you are fairly comfortable with [[PHP FAQ|PHP]] in general and that you are able to [[:en:Installing AMP|install a database and web server]] on your local machine.

If you need to learn PHP, you can see one PHP tutorial at http://www.w3schools.com/php/default.asp, another at http://php.net/manual/en/tutorial.php and several videos in YouTube at https://www.youtube.com/results?search_query=learn+php.

==Background==
===What's in the box===
If you [http://download.moodle.org/ download] Moodle source code or clone it from [https://github.com/moodle/moodle git], you will see a bunch of files and folders. This code consists of [[Core_APIs|Moodle core]] (that consists of the Very core and Core components), [[Moodle_libraries_credits|third party libraries]] and [[Plugin_types|plugins]]. Their mixed locations can be quite confusing at first but as you start working with it it will become more clear. Moodle developers should avoid modifications of the third party libraries (unless required) and core can never call methods defined in plugins. See also [[Communication Between Components]]

===Setting up your development environment===
* Moodle uses Git for developement. View the link below for basic information about Git and Moodle.

links

* [[Git for developers]]

===The Moodle development framework===
===What type of plugin are you developing?===
* Moodle has lots of different types of plugins.
* There are 24 different categories of plugin listed on the moodle plugin database. Before starting check here to see if someone else has not already created what you are looking for. Perhaps you could contribute to their plugin instead of creating a new one.

links

* https://moodle.org/plugins/
* [[Plugin types]]

==Let's make a plugin==
===The skeleton of your plugin===

The code of a plugin is organised into multiple different files and directories within a single root directory. All plugins follow the same directory and file structure.

Links

* [[Plugin files]] - Provides a list of required and common plugins files along with their location and purpose.

===Basic page structure===

The link below explains how to create and display a simple page in Moodle.

links

* [[Page API]]

===Multiple languages support===
Moodle has a mechanism for displaying given text string in multiple languages, depending on the user's preferred language and the site configuration. Plugin authors must define all the text strings used by the plugin in the default English language. That is then used as a reference for translations to other languages.

* [[String API]]

===Moodle file structure===
==== Automatic Class Loading ====
Automatic class loading helps us to automatically include class files as and when they are required instead of manually including them everytime. Moodle supports automatic class loading. See the link below for the explanation of rules associated with it -

links

[[ Automatic class loading]]

==== Callbacks ====
You can add a lot of features to your plugin by providing certain callbacks that Moodle expects to be present in your plugin's lib.php file. A detailed list can be found at -

links

[[ Callbacks ]]

==== Plugin types ====
Moodle supports a wide range of plugin types. The complete list and location to place these plugins can be found at -

links

[[ Plugin types ]]

==== Core APIs ====
Moodle provides apis for a plugin to interact with core and other external systems. For example you don't have to manually do any SQL queries, Moodle provides it's own DDL and DML layers. The link below lists all major core apis in Moodle -

links

[[ Core APIs ]]

==== Browser accessible pages ====
Any php file in your plugin will either be browser accessible or be an internal file.

For browser accessible pages you must include config.php with code something similar to this -
<code php>
require_once('../../config.php');
</code>

For internal files the code should use the following -
<code php>
defined('MOODLE_INTERNAL') || die();
</code>

===Add basic content to your page===
* Content for a page is added through renderers.
* Renderers are typically stored in the "classes/output" directory.
* Putting content in renderers allows themers to override the visual display of the content.
* Very basic information is presented using the html_writer class.

links

* [[Renderer]]

=== Content with templates ===
* In most cases when displaying content, templates should be used.
* templates are stored in the "templates" directory. The templates use mustache files.
* Mustache files allow for more generic html with placeholders inserted, that inserts the data (context) at run time.

links

* [[Templates]]

===Adding your plugin into Moodle's navigation===
* The moodle navigation system has hooks which allows plugins to add links to the navigation menu.
* Hooks are located in lib.php. Try to keep lib.php as small as possible as this file is included on every page. Put classes and functions elsewhere.
* Course navigation extension example:
<code php>
function tool_devcourse_extend_navigation_course($navigation, $course, $coursecontext) {
$url = new moodle_url('/admin/tool/devcourse/index.php');
$devcoursenode = navigation_node::create('Development course', $url, navigation_node::TYPE_CUSTOM, 'Dev course', 'devcourse');
$navigation->add_node($devcoursenode);
}
</code>

links

* [[Navigation API]]
* [[:en:Navigation]]

===Database queries===
* Moodle has a generic database query library. Behind this library are additional libraries which allow Moodle to work with MySQL, PostgreSQL, Oracle, SQL Server, and Maria DB.
* Where possible it is advisable to use the predefined functions rather than write out SQL. Writing SQL has a greater chance of not working with one of the supported databases.
* The return of the select functions tends to be an object or an array of objects.

Example call to retrieve data from the course table.
<code php>
global $DB;
$courses = $DB->get_records('course', null, '', 'id, category, fullname, shortname');
</code>
Example of data returned from the above code.
<code>
Array
(
[43] => stdClass Object
(
[id] => 43
[category] => 1
[fullname] => XX -Test 5
[shortname] => xxt5
)

[5] => stdClass Object
(
[id] => 5
[category] => 1
[fullname] => With the glossary
[shortname] => wtg
)

[39] => stdClass Object
(
[id] => 39
[category] => 1
[fullname] => XX - Test 1
[shortname] => xxt1
)
)
</code>

links

* [[Database]]
* [[Data manipulation API]]

===Creating your own database tables===
* We create our database tables in Moodle using the XMLDB editor. This is located in the administration block "Site administration | Development | XMLDB editor".
* The {plugin}\db directory needs to have write access for the XMLDB editor to be most effective.
* XMLDB editor creates an install.xml file in the db directory. This file will be loaded during the install to create your tables.
* XMLDB editor will produce php update code for adding and updating moodle database tables.

The XMLDB main page.

[[{{ns:file}}:xmldb-main.png]]

Upgrade code generated by the XMLDB

[[{{ns:file}}:xmldb-upgrade-code.png]]

links
* [[XMLDB editor]]
* [[Using XMLDB]]
* [[XMLDB Documentation]]
* [[Upgrade API]]
* [[XMLDB introduction]]

===Supporting access permissions: roles, capabilities and contexts===
* Capabilities are controlled in "access.php" under the "db" directory.
* This file has an array of capabilities with the following:
** name
** possible security risks behind giving this capability.
** The context that this capability works in.
** The default roles (teacher, manager, student, etc) that have this capability.
** Various other information.
* These capabilities are checked in code to allow access to pages, sections, and abilities (saving, deleting, etc).

links
* [[Access API]]
* [[Roles#Context]]
* [[:en:Category:Capabilities]]
* [[:en:Roles and permissions]]

===Adding web forms===
* Moodle has it's own forms library.
* The forms lib includes a lot of accessibility code, and error checking, by default.
* Moodle forms can be displayed in JavaScript using 'fragments'.

links

* [[Form API]]
* [[lib/formslib.php Form Definition]]
* [[Fragment]]

===Maintaining good security===
* Use the sesskey when directing to pages to do actions.
* Use the appropriate filters when retrieving parameters

links

* [[Security]]
* [[lib/formslib.php Form Definition#Most Commonly Used PARAM .2A Types]]
* [[Output functions#p.28.29 and s.28.29]]

===Handling files===
* Files are conceptually stored in file areas.
* Plugins can only access files from it's own component.

links

* [[File API]]
* [[File API internals]]
* [[Using the File API in Moodle forms]]

===Adding Javascript===
* Moodle is currently using jquery and AMD (Asynchronous Module Definition).
* JavaScript files are located in the "amd/src" directory.
* Use grunt to build your JavaScript.
* Include your JavaScript in php files as follows:
<code php>
$this->page->requires->js_call_amd('{JScriptfilename}');
</code>
* Can also be included in mustache templates.

links

* [[Javascript Modules]]
* [[jQuery]]
* [[Javascript FAQ]]
* [[JavaScript guidelines]]
* [[Grunt]]

===Adding events and logging===
* All logging in moodle is done through the events system.
* New events should be located in the "classes/event" directory.
* It is possible to create observers and subscribe to events.

links

* [[Event 2]]
* [[Logging 2]]

===Accessibility===

Accessibility is an important consideration while developing a plugin to make sure your plugin is accessible to all users and doesn't discriminate against users with disablities. This often is a mandated requirement in many countries. The links below explain common practices that we follow at Moodle to make the interface more accessible.

links
* [[Accessibility]]
* [[Usability]]
===Web Services and AJAX===
* Moodle web services uses the external functions API.
* The recommended way to make AJAX requests is to use the ajax AMD file which uses the external functions API.
* External functions should be located in the "classes/external.php" file.
* A list of services should be included in "db/services.php". This file is required to register the web services with Moodle.
* The services list is an array which contains:
** '''classname''' Name of the external class.
** '''methodname''' The name of the external function.
** '''classpath''' system path to the external function file.
** '''description''' Description of the function
** '''type''' Create, read, update, delete
** '''ajax''' Can this function be used with ajax?
** '''capabilities''' Capabilities required to use this function.

links

* [[External functions API]]
* [[Adding a web service to a plugin]]
* [[Web services API]]

===Using caching to improve performance===
* The main cache used by moodle is the Moodle Universal Cache (MUC).
* The MUC has several cache definitions -
** Request cache
** Session cache
** Application

links

* [[The Moodle Universal Cache (MUC)]]
* [[:en:MUC FAQ]]
* [[:en:Caching]]

===Supporting backup and restore===
* Supporting backup and restore requires creating several files in the 'backup/moodle2' directory.
* Back requires a class to extend the backup_task of some sort. There may be a specific plugin task to extend.
* Restore requires a class to extend the restore_task of some sort. There may be a specific plugin task to extend.
* The restore steps lib defines the structure of the plugin to be restored.
* The backup steps lib defines steps, settings, attributes, etc.

links

* [[Backup 2.0 for developers]] - provides an example of step-by-step implementation of backup support to your plugin
* [[:en:Backup and restore FAQ]]

===Supporting automated testing===
* Moodle has two types of automated testing: php unit tests, and behat tests.
* Unit tests are for testing functions.
* Behat tests runs through scenarios.
** Behat tests follows a script and navigates through moodle pages.
* unit tests should be located in the "tests" directory.
* behat tests should be located in the "tests/behat" directory.
* Tests located in these directories will be run through when a full test run is initiated.
* behat tests are actually feature files and end with the extension ".feature".

links

* [[Writing PHPUnit tests]]
* [[PHPUnit]]
* [[Acceptance testing]]
* [[Behat integration]]

==Publishing your plugin==
===Adding your plugin to moodle.org===
* publish your plugins at https://moodle.org/plugins/
* Publishing plugins on the moodle site leads you through a bunch of steps that need to be completed in order for the plugin to be approved and published.
* Plugins will be run through a pre-checker to give suggestions about possible issues with the code.

===Supporting your plugin===

==TODO==
Finishing this tutorial:
# About one or two screens for each section with a very generic overview for beginners, containing links to relevant docs WITH COMMENTS ABOUT QUALITY, USEFULNESS, CAVEATS etc.
# Go through all the linked pages and make sure they are current and accurate.
# Add a worked example to this page, so that each section has suggestions about things to add to the admin tool being built as an exercise. If the code is long, it could be placed on separate pages. A good reference for style is [[Moodle_Mobile_Developing_a_plugin_tutorial]] and [[Blocks]].

==See also==
* See [https://moodle.org/mod/forum/discuss.php?d=355789 this (july 2017) forum thread ] about Getting Started with Moodle Development.
* See [https://moodle.org/mod/forum/discuss.php?d=352360 this (may 2017) forum thread] about Getting Started with Moodle Development.

See also these older tutorials:

* [[Blocks|A Step-by-step Guide To Creating Blocks]]
* [[NEWMODULE Tutorial]]
* [[Moodle_Mobile_Developing_a_plugin_tutorial|Moodle Mobile plugin tutorial]]
* [[Category:Tutorial|Other tutorials in these docs]]

==Any questions?==

If you have any questions, you're welcome to post in the [https://moodle.org/mod/forum/view.php?id=55 General developer forum] on moodle.org

[[Category:Tutorial]]

Tutorial

2018-07-17T07:09:49Z

Dmonllao: /* Maintaining good security */

Welcome to Moodle development!

This is a tutorial to help you learn how to write plugins for Moodle from start to finish, while showing you how to navigate the most important developer documentation along the way.

PRE-REQUISITES: We assume you are fairly comfortable with [[PHP FAQ|PHP]] in general and that you are able to [[:en:Installing AMP|install a database and web server]] on your local machine.

If you need to learn PHP, you can see one PHP tutorial at http://www.w3schools.com/php/default.asp, another at http://php.net/manual/en/tutorial.php and several videos in YouTube at https://www.youtube.com/results?search_query=learn+php.

==Background==
===What's in the box===
If you [http://download.moodle.org/ download] Moodle source code or clone it from [https://github.com/moodle/moodle git], you will see a bunch of files and folders. This code consists of [[Core_APIs|Moodle core]] (that consists of the Very core and Core components), [[Moodle_libraries_credits|third party libraries]] and [[Plugin_types|plugins]]. Their mixed locations can be quite confusing at first but as you start working with it it will become more clear. Moodle developers should avoid modifications of the third party libraries (unless required) and core can never call methods defined in plugins. See also [[Communication Between Components]]

===Setting up your development environment===
* Moodle uses Git for developement. View the link below for basic information about Git and Moodle.

links

* [[Git for developers]]

===The Moodle development framework===
===What type of plugin are you developing?===
* Moodle has lots of different types of plugins.
* There are 24 different categories of plugin listed on the moodle plugin database. Before starting check here to see if someone else has not already created what you are looking for. Perhaps you could contribute to their plugin instead of creating a new one.

links

* https://moodle.org/plugins/
* [[Plugin types]]

==Let's make a plugin==
===The skeleton of your plugin===

The code of a plugin is organised into multiple different files and directories within a single root directory. All plugins follow the same directory and file structure.

Links

* [[Plugin files]] - Provides a list of required and common plugins files along with their location and purpose.

===Basic page structure===

The link below explains how to create and display a simple page in Moodle.

links

* [[Page API]]

===Multiple languages support===
Moodle has a mechanism for displaying given text string in multiple languages, depending on the user's preferred language and the site configuration. Plugin authors must define all the text strings used by the plugin in the default English language. That is then used as a reference for translations to other languages.

* [[String API]]

===Moodle file structure===
==== Automatic Class Loading ====
Automatic class loading helps us to automatically include class files as and when they are required instead of manually including them everytime. Moodle supports automatic class loading. See the link below for the explanation of rules associated with it -

links

[[ Automatic class loading]]

==== Callbacks ====
You can add a lot of features to your plugin by providing certain callbacks that Moodle expects to be present in your plugin's lib.php file. A detailed list can be found at -

links

[[ Callbacks ]]

==== Plugin types ====
Moodle supports a wide range of plugin types. The complete list and location to place these plugins can be found at -

links

[[ Plugin types ]]

==== Core APIs ====
Moodle provides apis for a plugin to interact with core and other external systems. For example you don't have to manually do any SQL queries, Moodle provides it's own DDL and DML layers. The link below lists all major core apis in Moodle -

links

[[ Core APIs ]]

==== Browser accessible pages ====
Any php file in your plugin will either be browser accessible or be an internal file.

For browser accessible pages you must include config.php with code something similar to this -
<code php>
require_once('../../config.php');
</code>

For internal files the code should use the following -
<code php>
defined('MOODLE_INTERNAL') || die();
</code>

===Add basic content to your page===
* Content for a page is added through renderers.
* Renderers are typically stored in the "classes/output" directory.
* Putting content in renderers allows themers to override the visual display of the content.
* Very basic information is presented using the html_writer class.

links

* [[Renderer]]

=== Content with templates ===
* In most cases when displaying content, templates should be used.
* templates are stored in the "templates" directory. The templates use mustache files.
* Mustache files allow for more generic html with placeholders inserted, that inserts the data (context) at run time.

links

* [[Templates]]

===Adding your plugin into Moodle's navigation===
* The moodle navigation system has hooks which allows plugins to add links to the navigation menu.
* Hooks are located in lib.php. Try to keep lib.php as small as possible as this file is included on every page. Put classes and functions elsewhere.
* Course navigation extension example:
<code php>
function tool_devcourse_extend_navigation_course($navigation, $course, $coursecontext) {
$url = new moodle_url('/admin/tool/devcourse/index.php');
$devcoursenode = navigation_node::create('Development course', $url, navigation_node::TYPE_CUSTOM, 'Dev course', 'devcourse');
$navigation->add_node($devcoursenode);
}
</code>

links

* [[Navigation API]]
* [[:en:Navigation]]

===Database queries===
* Moodle has a generic database query library. Behind this library are additional libraries which allow Moodle to work with MySQL, PostgreSQL, Oracle, SQL Server, and Maria DB.
* Where possible it is advisable to use the predefined functions rather than write out SQL. Writing SQL has a greater chance of not working with one of the supported databases.
* The return of the select functions tends to be an object or an array of objects.

Example call to retrieve data from the course table.
<code php>
global $DB;
$courses = $DB->get_records('course', null, '', 'id, category, fullname, shortname');
</code>
Example of data returned from the above code.
<code>
Array
(
[43] => stdClass Object
(
[id] => 43
[category] => 1
[fullname] => XX -Test 5
[shortname] => xxt5
)

[5] => stdClass Object
(
[id] => 5
[category] => 1
[fullname] => With the glossary
[shortname] => wtg
)

[39] => stdClass Object
(
[id] => 39
[category] => 1
[fullname] => XX - Test 1
[shortname] => xxt1
)
)
</code>

links

* [[Database]]
* [[Data manipulation API]]

===Creating your own database tables===
* We create our database tables in Moodle using the XMLDB editor. This is located in the administration block "Site administration | Development | XMLDB editor".
* The {plugin}\db directory needs to have write access for the XMLDB editor to be most effective.
* XMLDB editor creates an install.xml file in the db directory. This file will be loaded during the install to create your tables.
* XMLDB editor will produce php update code for adding and updating moodle database tables.

The XMLDB main page.

[[{{ns:file}}:xmldb-main.png]]

Upgrade code generated by the XMLDB

[[{{ns:file}}:xmldb-upgrade-code.png]]

links
* [[XMLDB editor]]
* [[Using XMLDB]]
* [[XMLDB Documentation]]
* [[Upgrade API]]
* [[XMLDB introduction]]

===Supporting access permissions: roles, capabilities and contexts===
* Capabilities are controlled in "access.php" under the "db" directory.
* This file has an array of capabilities with the following:
** name
** possible security risks behind giving this capability.
** The context that this capability works in.
** The default roles (teacher, manager, student, etc) that have this capability.
** Various other information.
* These capabilities are checked in code to allow access to pages, sections, and abilities (saving, deleting, etc).

links
* [[Access API]]
* [[Roles#Context]]
* [[:en:Category:Capabilities]]
* [[:en:Roles and permissions]]

===Adding web forms===
* Moodle has it's own forms library.
* The forms lib includes a lot of accessibility code, and error checking, by default.
* Moodle forms can be displayed in JavaScript using 'fragments'.

links

* [[Form API]]
* [[lib/formslib.php Form Definition]]
* [[Fragment]]

===Maintaining good security===
* Use the sesskey when directing to pages to do actions.
* Use the appropriate filters when retrieving parameters

links

* [https://docs.moodle.org/dev/Security Security docs]
* [[lib/formslib.php Form Definition#Most Commonly Used PARAM .2A Types]]
* [[Output functions#p.28.29 and s.28.29]]

===Handling files===
* Files are conceptually stored in file areas.
* Plugins can only access files from it's own component.

links

* [[File API]]
* [[File API internals]]
* [[Using the File API in Moodle forms]]

===Adding Javascript===
* Moodle is currently using jquery and AMD (Asynchronous Module Definition).
* JavaScript files are located in the "amd/src" directory.
* Use grunt to build your JavaScript.
* Include your JavaScript in php files as follows:
<code php>
$this->page->requires->js_call_amd('{JScriptfilename}');
</code>
* Can also be included in mustache templates.

links

* [[Javascript Modules]]
* [[jQuery]]
* [[Javascript FAQ]]
* [[JavaScript guidelines]]
* [[Grunt]]

===Adding events and logging===
* All logging in moodle is done through the events system.
* New events should be located in the "classes/event" directory.
* It is possible to create observers and subscribe to events.

links

* [[Event 2]]
* [[Logging 2]]

===Accessibility===

Accessibility is an important consideration while developing a plugin to make sure your plugin is accessible to all users and doesn't discriminate against users with disablities. This often is a mandated requirement in many countries. The links below explain common practices that we follow at Moodle to make the interface more accessible.

links
* [[Accessibility]]
* [[Usability]]
===Web Services and AJAX===
* Moodle web services uses the external functions API.
* The recommended way to make AJAX requests is to use the ajax AMD file which uses the external functions API.
* External functions should be located in the "classes/external.php" file.
* A list of services should be included in "db/services.php". This file is required to register the web services with Moodle.
* The services list is an array which contains:
** '''classname''' Name of the external class.
** '''methodname''' The name of the external function.
** '''classpath''' system path to the external function file.
** '''description''' Description of the function
** '''type''' Create, read, update, delete
** '''ajax''' Can this function be used with ajax?
** '''capabilities''' Capabilities required to use this function.

links

* [[External functions API]]
* [[Adding a web service to a plugin]]
* [[Web services API]]

===Using caching to improve performance===
* The main cache used by moodle is the Moodle Universal Cache (MUC).
* The MUC has several cache definitions -
** Request cache
** Session cache
** Application

links

* [[The Moodle Universal Cache (MUC)]]
* [[:en:MUC FAQ]]
* [[:en:Caching]]

===Supporting backup and restore===
* Supporting backup and restore requires creating several files in the 'backup/moodle2' directory.
* Back requires a class to extend the backup_task of some sort. There may be a specific plugin task to extend.
* Restore requires a class to extend the restore_task of some sort. There may be a specific plugin task to extend.
* The restore steps lib defines the structure of the plugin to be restored.
* The backup steps lib defines steps, settings, attributes, etc.

links

* [[Backup 2.0 for developers]] - provides an example of step-by-step implementation of backup support to your plugin
* [[:en:Backup and restore FAQ]]

===Supporting automated testing===
* Moodle has two types of automated testing: php unit tests, and behat tests.
* Unit tests are for testing functions.
* Behat tests runs through scenarios.
** Behat tests follows a script and navigates through moodle pages.
* unit tests should be located in the "tests" directory.
* behat tests should be located in the "tests/behat" directory.
* Tests located in these directories will be run through when a full test run is initiated.
* behat tests are actually feature files and end with the extension ".feature".

links

* [[Writing PHPUnit tests]]
* [[PHPUnit]]
* [[Acceptance testing]]
* [[Behat integration]]

==Publishing your plugin==
===Adding your plugin to moodle.org===
* publish your plugins at https://moodle.org/plugins/
* Publishing plugins on the moodle site leads you through a bunch of steps that need to be completed in order for the plugin to be approved and published.
* Plugins will be run through a pre-checker to give suggestions about possible issues with the code.

===Supporting your plugin===

==TODO==
Finishing this tutorial:
# About one or two screens for each section with a very generic overview for beginners, containing links to relevant docs WITH COMMENTS ABOUT QUALITY, USEFULNESS, CAVEATS etc.
# Go through all the linked pages and make sure they are current and accurate.
# Add a worked example to this page, so that each section has suggestions about things to add to the admin tool being built as an exercise. If the code is long, it could be placed on separate pages. A good reference for style is [[Moodle_Mobile_Developing_a_plugin_tutorial]] and [[Blocks]].

==See also==
* See [https://moodle.org/mod/forum/discuss.php?d=355789 this (july 2017) forum thread ] about Getting Started with Moodle Development.
* See [https://moodle.org/mod/forum/discuss.php?d=352360 this (may 2017) forum thread] about Getting Started with Moodle Development.

See also these older tutorials:

* [[Blocks|A Step-by-step Guide To Creating Blocks]]
* [[NEWMODULE Tutorial]]
* [[Moodle_Mobile_Developing_a_plugin_tutorial|Moodle Mobile plugin tutorial]]
* [[Category:Tutorial|Other tutorials in these docs]]

==Any questions?==

If you have any questions, you're welcome to post in the [https://moodle.org/mod/forum/view.php?id=55 General developer forum] on moodle.org

[[Category:Tutorial]]

Javascript Modules

2018-05-21T10:00:43Z

Dmonllao: /* Install grunt */

{{Moodle 2.9}}

= Javascript Modules =

== What is a Javascript module and why do I care? ==

A Javascript module is nothing more than a collection of Javascript code that can be used (reliably) from other pieces of Javascript.

== Why should I package my code as a module? ==

By packaging your code as a module you break your code up into smaller reusable pieces. This is good because:

a) Each smaller piece is simpler to understand / debug

b) Each smaller piece is simpler to test

c) You can re-use common code instead of duplicating it

= How do I write a Javascript module in Moodle? =

Since version 2.9, Moodle supports Javascript modules written using the Asynchronous Module Definition ([https://github.com/amdjs/amdjs-api/wiki/AMD AMD]) API. This is a standard API for creating Javascript modules and you will find many useful third party libraries that are already using this format.

To edit or create an AMD module in Moodle you need to do a couple of things.

== Install grunt ==

The AMD modules in Moodle must be processed by some build tools before they will be visible to your web browser. We use "[[grunt]]" as a build tool to wrap our different processes. Grunt is a build tool written in Javascript that runs in the "[http://nodejs.org/ nodejs]" environment.

This means you first have to '''install nodejs''' - and its package manager [https://www.npmjs.com/ npm]. The details of how to install those packages will vary by operating system, but on Linux it's probably similar to "sudo apt-get install nodejs npm". There are downloadable packages for other operating systems here: http://nodejs.org/download/. Moodle currently requires node "lts/carbon".

Once this is done, you can '''run the command''':

npm install
npm install -g grunt-cli

from the top of the Moodle directory to install all of the required tools. (You may need extra permissions to use the -g option.)

== Development mode ==

To avoid having to constantly run grunt, make sure you set the following in your config.php

<code php>
// Prevent JS caching
$CFG->cachejs = false;
</code>

Moodle will now run your module from the amd/src module. Don't forget to switch this off and run 'grunt' before deploying the new version!

In this mode - if you get a strange message in your javascript console like "No define call for core/first" it means you have a syntax error in the javascript you are developing.
Or, "No define call for theme_XXX/loader" as you are probably missing the 'src' folder with relevant JS files. which might happen when you turn debugging ON on a theme that was bought, without 'src' folder :-(

== Running grunt ==

You can run grunt in your plugin's 'amd' directory and it will only operate on your modules. If you're having problems or just want to check your work it is worth running for the 'lint' feature. This can find basic problems. This sub-directory support wont work on Windows unfortunately but there is an alternative: Run grunt from the top directory with the --root=path/to/dir to limit execution to a sub-directory.

See [[Grunt#Running_grunt]] for more details of specific grunt commands which can be used.

If you get the error message

/usr/bin/env: node: No such file or directory

Then see the thread https://github.com/nodejs/node-v0.x-archive/issues/3911

On Ubuntu 14.04 this fixed it for me:

sudo ln -fs /usr/bin/nodejs /usr/local/bin/node

Note: Once you have run grunt and built your code, you will then need to purge Moodle caches otherwise the changes made to your minified files may not be picked up by Moodle.

== Minimum (getting started) module for plugins ==

This shows the absolute minimum module you need to get started adding modules to your plugins. It's actually quite simple...

<code javascript>
// Put this file in path/to/plugin/amd/src
// You can call it anything you like

define(['jquery'], function($) {

return {
init: function() {

// Put whatever you like here. $ is available
// to you as normal.
$(".someclass").change(function() {
alert("It changed!!");
});
}
};
});
</code>

This code passes the jquery module into our function (parameter $). There are a number of other useful modules available in Moodle, some of which you'll probably need in a practical application. See [[Useful_core_Javascript_modules]]. Simply list them in both the define() first parameter and the function callback. E.g.,
<code javascript>
define(['jquery', 'core/str', 'core/ajax'], function($, str, ajax) {
</code>

The idea here is that we will run the 'init' function from our (PHP) code to set things up. This is called from PHP like this...

<code php>
$PAGE->requires->js_call_amd('frankenstyle_path/your_js_filename', 'init');
</code>

Don't forget to supply the complete '[[Frankenstyle]]' path. The .js is not needed.

js_call_amd takes a third parameter which is an ''array'' of parameters. These will translate to individual parameters in the 'init' function call. For example...

<code php>
$PAGE->requires->js_call_amd('block_iomad_company_admin/department_select', 'init', array($first, $last));
</code>

...calls

<code javascript>
return {
init: function(first, last) {
}
</code>

A more comprehensive explanation follows...

== "Hello World" I am a Javascript Module ==
Lets now create a simple Javascript module so we can see how to lay things out.

Each Javascript module is contained in a single source file in the <componentdir>/amd/src folder. The final name of the module is taken from the file name and the component name. E.g. block_overview/amd/src/helloworld.js would be a module named "block_overview/helloworld". the name of the module is important when you want to call it from somewhere else in the code.

After running grunt - the minified Javascript files are stored in the <componentdir>/amd/build folder. The javascript files are renamed to show that they are minified (helloworld.js becomes helloworld.min.js).

Don't forget to add the built files (the ones in amd/build) to your git commits, or in production no-one will see your changes.

Lets create a simple module now:

blocks/overview/amd/src/helloworld.js
<code javascript>
// Standard license block omitted.
/*
* @package block_overview
* @copyright 2015 Someone cool
* @license http://www.gnu.org/copyleft/gpl.html GNU GPL v3 or later
*/

/**
* @module block_overview/helloworld
*/
define(['jquery'], function($) {

/**
* Give me blue.
* @access private
* @return {string}
*/
var makeItBlue = function() {
// We can use our jquery dependency here.
return $('.blue').show();
};

/**
* @constructor
* @alias module:block_overview/helloworld
*/
var greeting = function() {
/** @access private */
var privateThoughts = 'I like the colour blue';

/** @access public */
this.publicThoughts = 'I like the colour orange';

};

/**
* A formal greeting.
* @access public
* @return {string}
*/
greeting.prototype.formal = function() {
return 'How do you do?';
};

/**
* An informal greeting.
* @access public
* @return {string}
*/
greeting.prototype.informal = function() {
return 'Wassup!';
};
return greeting;
});
</code>

The most interesting line above is:
<code>
define(['jquery'], function($) {
</code>

All AMD modules must call "define()" as the first and only global scoped piece of code. This ensures the javascript code contains no global variables and will not conflict with any other loaded module. The name of the module does not need to be specified because it is determined from the filename and component (but it can be listed in a comment for JSDoc as shown here).

The first argument to "define" is the list of dependencies for the module. This argument must be passed as an array, even if there is only one. In this example "jquery" is a dependency. "jquery" is shipped as a core module is available to all AMD modules.

The second argument to "define" is the function that defines the module. This function will receive as arguments, each of the requested dependencies in the same order they were requested. In this example we receive JQuery as an argument and we name the variable "$" (it's a JQuery thing). We can then access JQuery normally through the $ variable which is in scope for any code in our module.

The rest of the code in this example is a standard way to define a Javascript module with public/private variables and methods. There are many ways to do this, this is only one.

It is important that we are returning 'greeting'. If there is no return then your module will be declared as undefined.

== Loading modules dynamically ==
What do you do if you don't know in advance which modules will be required? Stuffing all possible required modules in the define call is one solution, but it's ugly and it only works for code that is in an AMD module (what about inline code in the page?). AMD lets you load a dependency any time you like.
<code javascript>

// Load a new dependency.
require(['mod_wiki/timer'], function(timer) {
// timer is available to do my bidding.
});
</code>

== Including an external javascript/jquery library ==
If you want to include a javascript / jquery library downloaded from the internet you can do so as follows:

'''Warning: if the library you download, supports AMD but is already "named" you will not be able to include it directly'''
e.g.
<code javascript>
define("typeahead.js", *[ "jquery" ], function(a0) {
return factory(a0);
});
</code>
will not work, as moodle injects it's own define name when loading the library.

If the library is in AMD format and has a define:
e.g. i want to include the jquery final countdown timer on my page ( hilios.github.io/jQuery.countdown/ )
* download the module in both normal and minified versions
* place the modules in your moodle install e.g. your custom theme dir, or plugin dir
* /theme/mytheme/amd/src/jquery.countdown.js

you can now include the module and initialise it (there are multiple ways to do this)
php:

1. Create your own AMD module and initialise it:

In your PHP file:
<code php>
$this->page->requires->js_call_amd('theme_mytheme/countdowntimer', 'initialise', $params);
</code>

Javascript module:
<code javascript>
// /theme/mytheme/amd/src/countdowntimer.js
define(['jquery', 'theme_mytheme/jquery.countdown'], function($, c) {
return {
initialise: function ($params) {
$('#clock').countdown('2020/10/10', function(event) {
$(this).html(event.strftime('%D days %H:%M:%S'));
});
}
};
});
</code>

2. Put the javascript into a mustache template:
<code javascript>
// /theme/mytheme/templates/countdowntimer.mustache
<span id="clock"></span>
{{#js}}
require(['jquery', 'theme_mytheme/jquery.countdown'], function($) {
$('#clock').countdown('2020/10/10', function(event) {
$(this).html(event.strftime('%D days %H:%M:%S'));
});
});
{{/js}}
</code>

3. Call the javascript directly from php (although who would want to put javascript into php? ergh):
<code php>
$PAGE->requires->js_amd_inline('
require(['theme_mytheme/jquery.countdown'], function(min) {
$('#clock').countdown('2020/10/10', function(event) {
$(this).html(event.strftime('%D days %H:%M:%S'));
});
});
');
</code>

== Embedding AMD code in a page ==
So you have created lots of cool Javascript modules. Great. How do we actually call them? Any javascript code that calls an AMD module must execute AFTER the requirejs module loader has finished loading. We have provided a function "js_call_amd" that will call a single function from an AMD module with parameters.

<code php>
$PAGE->requires->js_call_amd($modulename, $functionname, $params);
</code>

that will "do the right thing" with your block of AMD code and execute it at the end of the page, after our AMD module loader has loaded.
Notes:
* the $modulename is the 'componentname/modulename' discussed above
* the $functionname is the name of a public function exposed by the amd module.
* the $params is an array of params passed as arguments to the function. These should be simple types that can be handled by json_encode (no recursive arrays, or complex classes please).
* if the size of the params array is too large (> 1Kb), this will produce a developer warning. Do not attempt to pass large amounts of data through this function, it will pollute the page size. A preferred approach is to pass css selectors for DOM elements that contain data-attributes for any required data, or fetch data via ajax in the background.

AMD / JS code can also be embedded on a page via mustache templates
see here: https://docs.moodle.org/dev/Templates#What_if_a_template_contains_javascript.3F

== But I have a mega JS file I don't want loaded on every page? ==
Loading all JS files at once and stuffing them in the browser cache is the right choice for MOST js files, there are probably some exceptions. For these files, you can rename the javascript file to end with the suffix "-lazy.js" which indicates that the module will not be loaded by default, it will be requested the first time it is used. There is no difference in usage for lazy loaded modules, the require() call looks exactly the same, it's just that the module name will also have the "-lazy" suffix.

== Useful links ==
* [https://assets.moodlemoot.org/sites/15/20171004085436/JavaScript-AMD-with-RequireJS-presented-by-Daniel-Roperto-Catalyst.pdf JavaScript AMD with RequireJS] presented by Daniel Roperto, Catalyst. (MoodleMOOT AU 2017)
* [[Useful_core_Javascript_modules]]

[[Category:AJAX]]
[[Category:Javascript]]

Analytics API

2018-05-17T07:27:33Z

Dmonllao: /* Indicator (core_analytics\local\indicator\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

'''processes_user_data''' and '''join_sample_user''' methods are used by the analytics implementation of the privacy API. You only need to overwrite them if your analyser deals with user data. They are used to export and delete user data that is stored in analytics database tables:

/**
* Whether the plugin needs user data clearing or not.
*
* @return bool
*/
public function processes_user_data();
/**
* SQL JOIN from a sample to users table.
*
* More info in [https://github.com/moodle/moodle/blob/master/analytics/classes/local/analyser/base.php core_analytics\local\analyser\base]::join_sample_user
*
* @param string $sampletablealias The alias of the table with a sampleid field that will join with this SQL string
* @return string
*/
public function join_sample_user($sampletablealias);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators. In case you want your activity module to implement a [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out#Indicators community of inquiry] indicator you can extend ''core_analytics\local\indicator\community_of_inquiry_indicator'' look for examples in Moodle core.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is not unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

The first thing a target must define is the analyser class that it will use. The analyser class is specified in '''get_analyser_class'''.

/**
* Returns the analyser class that should be used along with this target.
*
* @return string The full class name as a string
*/
abstract public function get_analyser_class();

'''is_valid_analysable''' and '''is_valid_sample''' are used to discard elements that are not valid for your target.

/**
* Allows the target to verify that the analysable is a good candidate.
*
* This method can be used as a quick way to discard invalid analysables.
* e.g. Imagine that your analysable don't have students and you need them.
*
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return true|string
*/
public function is_valid_analysable(\core_analytics\analysable $analysable, $fortraining = true);

/**
* Is this sample from the $analysable valid?
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return bool
*/
public function is_valid_sample($sampleid, \core_analytics\analysable $analysable, $fortraining = true);

'''calculate_sample''' is the method that calculates the target value.

/**
* Calculates this target for the provided samples.
*
* In case there are no values to return or the provided sample is not applicable just return null.
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param int|false $starttime Limit calculations to start time
* @param int|false $endtime Limit calculations to end time
* @return float|null
*/
protected function calculate_sample($sampleid, \core_analytics\analysable $analysable, $starttime = false, $endtime = false);

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

Time-splitting methods are useful to define when the analytics API will train the predictions processor and when it will generate predictions. As explained above in [[Analytics_API#Time_splitting_methods]], they define time ranges based on analysable elements start and end timestamps.

The base class is '''\core_analytics\local\time_splitting\base'''; if what you are after is to split the analysable duration in equal parts or in cumulative parts you can extend '''\core_analytics\local\time_splitting\equal_parts''' or '''\core_analytics\local\time_splitting\accumulative_parts''' instead.

'''define_ranges''' is the main method to implement and its values mostly depend on the current analysable element (available in '''$this->analysable'''). An array of time ranges should be returned, each of these ranges should contain 3 attributes: A start time ('start') and an end time ('end') that will be passed to indicators so they can limit the amount of activity logs they read; the 3rd attribute is 'time', which value will determine when the range will be executed.

/**
* Define the time splitting methods ranges.
*
* 'time' value defines when predictions are executed, their values will be compared with
* the current time in ready_to_predict
*
* @return array('start' => time(), 'end' => time(), 'time' => time())
*/
protected function define_ranges();

A name and description should also be specified:

/**
* Returns a lang_string object representing the name for the time splitting method.
*
* Used as column identificator.
*
* If there is a corresponding '_help' string this will be shown as well.
*
* @return \lang_string
*/
public static function get_name() : \lang_string;

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

New models can be created and implemented in php, and can be packaged as a Moodle local plugin for distribution. Sample model components and models are provided at https://github.com/dmonllao/moodle-local_testanalytics.

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

To add a new model to the system, it must be defined in a PHP file. Normally this is done as part of install.php or upgrade.php for a plugin that contains the new model and components. However, it is also possible to execute the necessary commands in a standalone PHP file that references the Moodle config.php.

To create the model, specify at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2018-05-17T06:57:44Z

Dmonllao: /* Analyser (core_analytics\local\analyser\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

'''processes_user_data''' and '''join_sample_user''' methods are used by the analytics implementation of the privacy API. You only need to overwrite them if your analyser deals with user data. They are used to export and delete user data that is stored in analytics database tables:

/**
* Whether the plugin needs user data clearing or not.
*
* @return bool
*/
public function processes_user_data();
/**
* SQL JOIN from a sample to users table.
*
* More info in [https://github.com/moodle/moodle/blob/master/analytics/classes/local/analyser/base.php core_analytics\local\analyser\base]::join_sample_user
*
* @param string $sampletablealias The alias of the table with a sampleid field that will join with this SQL string
* @return string
*/
public function join_sample_user($sampletablealias);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is not unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

The first thing a target must define is the analyser class that it will use. The analyser class is specified in '''get_analyser_class'''.

/**
* Returns the analyser class that should be used along with this target.
*
* @return string The full class name as a string
*/
abstract public function get_analyser_class();

'''is_valid_analysable''' and '''is_valid_sample''' are used to discard elements that are not valid for your target.

/**
* Allows the target to verify that the analysable is a good candidate.
*
* This method can be used as a quick way to discard invalid analysables.
* e.g. Imagine that your analysable don't have students and you need them.
*
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return true|string
*/
public function is_valid_analysable(\core_analytics\analysable $analysable, $fortraining = true);

/**
* Is this sample from the $analysable valid?
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return bool
*/
public function is_valid_sample($sampleid, \core_analytics\analysable $analysable, $fortraining = true);

'''calculate_sample''' is the method that calculates the target value.

/**
* Calculates this target for the provided samples.
*
* In case there are no values to return or the provided sample is not applicable just return null.
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param int|false $starttime Limit calculations to start time
* @param int|false $endtime Limit calculations to end time
* @return float|null
*/
protected function calculate_sample($sampleid, \core_analytics\analysable $analysable, $starttime = false, $endtime = false);

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

Time-splitting methods are useful to define when the analytics API will train the predictions processor and when it will generate predictions. As explained above in [[Analytics_API#Time_splitting_methods]], they define time ranges based on analysable elements start and end timestamps.

The base class is '''\core_analytics\local\time_splitting\base'''; if what you are after is to split the analysable duration in equal parts or in cumulative parts you can extend '''\core_analytics\local\time_splitting\equal_parts''' or '''\core_analytics\local\time_splitting\accumulative_parts''' instead.

'''define_ranges''' is the main method to implement and its values mostly depend on the current analysable element (available in '''$this->analysable'''). An array of time ranges should be returned, each of these ranges should contain 3 attributes: A start time ('start') and an end time ('end') that will be passed to indicators so they can limit the amount of activity logs they read; the 3rd attribute is 'time', which value will determine when the range will be executed.

/**
* Define the time splitting methods ranges.
*
* 'time' value defines when predictions are executed, their values will be compared with
* the current time in ready_to_predict
*
* @return array('start' => time(), 'end' => time(), 'time' => time())
*/
protected function define_ranges();

A name and description should also be specified:

/**
* Returns a lang_string object representing the name for the time splitting method.
*
* Used as column identificator.
*
* If there is a corresponding '_help' string this will be shown as well.
*
* @return \lang_string
*/
public static function get_name() : \lang_string;

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

New models can be created and implemented in php, and can be packaged as a Moodle local plugin for distribution. Sample model components and models are provided at https://github.com/dmonllao/moodle-local_testanalytics.

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

To add a new model to the system, it must be defined in a PHP file. Normally this is done as part of install.php or upgrade.php for a plugin that contains the new model and components. However, it is also possible to execute the necessary commands in a standalone PHP file that references the Moodle config.php.

To create the model, specify at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Privacy API

2018-04-24T08:34:00Z

Dmonllao: /* Delete for a context */

The [https://en.wikipedia.org/wiki/General_Data_Protection_Regulation General Data Protection Regulation] (GDPR) is an EU directive that looks at providing users with more control over their data and how it is processed. This regulation will come into effect on 25th of May 2018 and covers any citizen or permanent resident of the European Union. The directive will be respected by a number of other countries outside of the European Union.

To help institutions become compliant with this new regulation we are adding functionality to Moodle. This includes a number of components, amongst others these include a user’s right to:

* request information on the types of personal data held, the instances of that data, and the deletion policy for each;
* access all of their data; and
* be forgotten.

The compliance requirements also extend to installed plugins (including third party plugins). These need to also be able to report what information they store or process regarding users, and have the ability to provide and delete data for a user request.

This document describes the proposed API changes required for plugins which will allow a Moodle installation to become GDPR compliant.

Target Audience: The intended audience for this document is Moodle plugin developers, who are aiming to ensure their plugins are updated to comply with GDPR requirements coming into effect in the EU in May, 2018.

==Personal data in Moodle==

From the GDPR Spec, Article 4:

''‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;''

In Moodle, we need to consider two main types of personal data; information entered by the user and information stored about the user. The key difference being that information stored about the user will have come from a source other than the user themselves. Both types of data can be used to form a profile of the individual.

The most obvious clue to finding personal data entered by the user is the presence of a userid on a database field. Any data on the record (or linked records) pertaining to that user may be deemed personal data for that user, including things like timestamps and record identification numbers. Additionally, any free text field which allows the user to enter information must also be considered to be the personal data of that user.

Data stored about the user includes things like ratings and comments made on a student submission. These may have been made by an assessor or teacher, but are considered the personal data of the student, as they are considered a reflection of the user’s competency in the subject matter and can be used to form a profile of that individual.

The sections that follow outline what you need to do as a plugin developer to ensure any personal data is advertised and can be accessed and deleted according to the GDPR requirements.

==Background==

===Architecture overview===

[[File:MoodlePrivacyMetadataUML.png|thumb|UML diagram of the metadata part of the privacy subsystem]]
[[File:MoodlePrivacyRequestUML.png|thumb|UML diagram of the request providers part of the privacy subsystem]]

A new system for Privacy has been created within Moodle. This is broken down into several main parts and forms the ''core_privacy'' subsystem:

* Some metadata providers - a set of PHP interfaces to be implemented by components for that component to describe the kind of data that it stores, and the purpose for its storage;
* Some request providers - a set of PHP interfaces to be implemented by components to allow that component to act upon user requests such as the Right to be Forgotten, and a Subject Access Request; and
* A manager - a concrete class used to bridge components which implement the providers with tools which request their data.

All plugins will implement one metadata provider, and zero, one or two request providers.

The fetching of data is broken into two separate steps:

# Detecting in which Moodle contexts the user has any data; and
# Exporting all data from each of those contexts.

This has been broken into two steps to later allow administrators to exclude certain contexts from an export - e.g. for courses currently in progress.

A third component will later be added to facilitate the deletion of data within these contexts which will help to satisfy the Right to be Forgotten. This will also use the first step.

===Implementing a provider===

All plugins will need to create a concrete class which implements the relevant metadata and request providers. The exact providers you need to implement will depend on what data you store, and the type of plugin. This is covered in more detail in the following sections of the document.

In order to do so:

# You must create a class called ''provider'' within the namespace ''\your_pluginname\privacy''.
# This class must be created at ''path/to/your/plugin/classes/privacy/provider.php''.
# You must have your class implement the relevant metadata and request interfaces.

==Plugins which do not store personal data==

Many Moodle plugins do not store any personal data. This is usually the case for plugins which just add functionality, or which display the data already stored elsewhere in Moodle.

Some examples of plugin types which might fit this criteria include themes, blocks, filters, editor plugins, etc.

Plugins which cause data to be stored elsewhere in Moodle (e.g. via a subsystem call) are considered to store data.

One examples of a plugin which does not store any data would be the Calendar month block which just displays a view of the user’s calendar. It does not store any data itself.

An example of a plugin which must not use the null provider is the Comments block. The comments block is responsible for data subsequently being stored within Moodle. Although the block doesn’t store anything itself, it interacts with the comments subsystem and is the only component which knows how that data maps to a user.

===Implementation requirements===

In order to let Moodle know that you have audited your plugin, and that you do not store any personal user data, you must implement the ''\core_privacy\local\metadata\null_provider'' interface in your plugin’s provider.

These null providers can only be implemented where a plugin has:

* no external links (e.g. sends data to an external service like an LTI provider, repository plugin which you can search on)
* no database tables which store user data (including IP addresses)
* no user preferences

The ''null_provider'' requires you to define one function ''get_reason()'' which returns the language string identifier within your component.

====Example====

''blocks/calendar_month/classes/privacy/provider.php''

<code php>
<?php
// …

namespace block_calendar_month\privacy;

class provider implements
// This plugin does not store any personal user data.
\core_privacy\local\metadata\null_provider {

/**
* Get the language string identifier with the component's language
* file to explain why this plugin stores no data.
*
* @return string
*/
public static function get_reason() : string {
return 'privacy:metadata';
}
}
</code>

''blocks/calendar_month/lang/en/block_calendar_month.php''

<code php>
<?php
// …

...
$string['privacy:metadata'] = 'The Calendar block only displays existing calendar data.';
...
</code>

That’s it. Congratulations, your plugin now implements the Privacy API.

==Plugins which store personal data==

Many Moodle plugins do store some form of personal data.

In some cases this will be stored within database tables in your plugin, and in other cases this will be in one of Moodle’s core subsystems - for example your plugin may store files, ratings, comments, or tags.

Plugins which do store data will need to:

* Describe the type of data that they store;
* Provide a way to export that data; and
* Provide a way to delete that data.

Data is described via a ''metadata'' provider, and it is both exported and deleted via an implementation of a ''request'' provider.

These are both explained in the sections below.

===Describing the type of data you store===

In order to describe the type of data that you store, you must implement the ''\core_privacy\local\metadata\provider'' interface.

This interfaces requires that you define one function: ''get_metadata''.

There are several types of item to describe the data that you store. These are for:

* Items in the Moodle database;
* Items stored by you in a Moodle subsystem - for example files, and ratings; and
* User preferences stored site-wide within Moodle for your plugin

Note: All fields should include a description from a language string within your plugin.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
<?php
// …

namespace mod_forum\privacy;
use core_privacy\local\metadata\collection;

class provider implements
// This plugin does store personal user data.
\core_privacy\local\metadata\provider {

public static function get_metadata(collection $collection) : collection {

// Here you will add more items into the collection.

return $collection;
}
}
</code>

====Indicating that you store content in a Moodle subsystem====

Many plugins will use one of the core Moodle subsystems to store data.

As a plugin developer we do not expect you to describe those subsystems in detail, but we do need to know that you use them and to know what you use them for.

You can indicate this by calling the ''add_subsystem_link()'' method on the ''collection''.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_subsystem_link(
'core_files',
[],
'privacy:metadata:core_files'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:core_files'] = 'The forum stores files which have been uploaded by the user to form part of a forum post.';
</code>

====Describing data stored in database tables====

Most Moodle plugins will store some form of user data in their own database tables.

As a plugin developer you will need to describe each database table, and each field which includes user data.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_database_table(
'forum_discussion_subs',
[
'userid' => 'privacy:metadata:forum_discussion_subs:userid',
'discussionid' => 'privacy:metadata:forum_discussion_subs:discussionid',
'preference' => 'privacy:metadata:forum_discussion_subs:preference',

],
'privacy:metadata:forum_discussion_subs'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:forum_discussion_subs'] = 'Information about the subscriptions to individual forum discussions. This includes when a user has chosen to subscribe to a discussion, or to unsubscribe from one where they would otherwise be subscribed.';
$string['privacy:metadata:forum_discussion_subs:userid'] = 'The ID of the user with this subscription preference.';
$string['privacy:metadata:forum_discussion_subs:discussionid'] = 'The ID of the discussion that was subscribed to.';
$string['privacy:metadata:forum_discussion_subs:preference'] = 'The start time of the subscription.';
</code>

====Indicating that you store site-wide user preferences====

Many plugins will include one or more user preferences. Unfortunately this is one of Moodle’s older components and many of the values stored are not pure user preferences. Each plugin should be aware of how it handles its own preferences and is best placed to determine whether they are site-wide preferences, or per-instance preferences.

Whilst most of these will have a fixed name (e.g. ''filepicker_recentrepository''), some will include a variable of some kind (e.g. ''tool_usertours_tour_completion_time_2''). Only the general name needs to be indicated rather than one copy for each preference.

Also, these should only be ''site-wide'' user preferences which do not belong to a specific Moodle context.

In the above examples:

* Preference ''filepicker_recentrepository'' belongs to the file subsystem, and is a site-wide preference affecting the user anywhere that they view the filepicker.
* Preference ''tool_usertours_tour_completion_time_2'' belongs to user tours. User tours are a site-wide feature which can affect many parts of Moodle and cross multiple contexts.

In some cases a value may be stored in the preferences table but is known to belong to a specific context within Moodle. In these cases they should be stored as metadata against that context rather than as a site-wide user preference.

You can indicate this by calling the ''add_user_preference()'' method on the ''collection''.

Any plugin providing user preferences must also implement the ''\core_privacy\local\request\preference_provider''.

=====Example=====

''admin/tool/usertours/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_user_preference('tool_usertours_tour_completion_time,
'privacy:metadata:preference:tool_usertours_tour_completion_time');

return $collection;
}
</code>

''admin/tool/usertours/lang/en/tool_usertours.php''

<code php>
<?php

$string['privacy:metadata:tool_usertours_tour_completion_time'] = 'The time that a specific user tour was last completed by a user.';
</code>

====Indicating that you export data to an external location====

Many plugins will interact with external systems - for example cloud-based services. Often this external location is configurable within the plugin either at the site or the instance level.

As a plugin developer you will need to describe each ''type'' of target destination, alongside a list of each exported field which includes user data.
The ''actual'' destination does not need to be described as this can change based on configuration.

You can indicate this by calling the ''add_external_location_link()'' method on the collection.

=====Example=====

''mod/lti/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_external_location_link('lti_client', [
'userid' => 'privacy:metadata:lti_client:userid',
'fullname' => 'privacy:metadata:lti_client:fullname',
], 'privacy:metadata:lti_client');

return $collection;
}
</code>

''mod/lti/lang/en/lti.php''

<code php>
<?php

$string['privacy:metadata:lti_client'] = 'In order to integrate with a remote LTI service, user data needs to be exchanged with that service.';
$string['privacy:metadata:lti_client:userid'] = 'The userid is sent from Moodle to allow you to access your data on the remote system.';
$string['privacy:metadata:lti_client:fullname'] = 'Your full name is sent to the remote system to allow a better user experience.';
</code>

===Providing a way to export user data===

In order to export the user data that you store, you must implement the relevant request provider.

We have named these request providers because they are called in response to a specific request from a user to access their information.

There are several different types of request provider, and you may need to implement several of these, depending on the type and nature of your plugin.

Broadly speaking plugins will fit into one of the following categories:

* Plugins which are a subplugin of another plugin. Examples include ''assignsubmission'', ''atto'', and ''datafield'';
* Plugins which are typically called by a Moodle subsystem. Examples include ''qtype'', and ''profilefield'';
* All other plugins which store data.

Most plugins will fit into this final category, whilst other plugins may fall into several categories.
Plugins which ''define'' a subplugin will also be responsible for collecting this data from their subplugins.

A final category exists - plugins which store user preferences. In some cases this may be the ''only'' provider implemented.

====Standard plugins which store data====

A majority of Moodle plugins will fit into this category and will be required to implement the ''\core_privacy\local\request\plugin\provider'' interface. This interface requires that you define two functions:

* ''get_contexts_for_userid'' - to explain where data is held within Moodle for your plugin; and
* ''export_user_data'' - to export a user’s personal data from your plugin.

These APIs make use of the Moodle ''context'' system to hierarchically store this data.

====Retrieving the list of contexts====

Contexts are retrieved using the ''get_contexts_for_userid'' function which takes the ID of the user being fetched, and returns a list of contexts in which the user has any data.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {}
</code>

The function returns a ''\core_privacy\local\request\contextlist'' which is used to keep a set of contexts together in a fixed fashion.

Because a Subject Access Request covers ''every'' piece of data that is held for a user within Moodle, efficiency and performance is highly important. As a result, contexts are added to the ''contextlist'' by defining one or more SQL queries which return just the contextid. Multiple SQL queries can be added as required.

Many plugins will interact with specific subsystems and store data within them.
These subsystems will also provide a way in which to link the data that you have stored with your own database tables.
At present these are still a work in progress and only the ''core_ratings'' subsystem includes this.

=====Basic example=====

The following example simply fetches the contextid for all forums where a user has a single discussion (note: this is an incomplete example):

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$contextlist = new \core_privacy\local\request\contextlist();

$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
WHERE (
d.userid = :discussionuserid
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'discussionuserid' => $userid,
];

$contextlist->add_from_sql($sql, $params);
}
</code>

=====More complete example=====

The following example includes a link to core_rating.
It will find any forum, forum discussion, or forum post where the user has any data, including:

* Per-forum digest preferences;
* Per-forum subscription preferences;
* Per-forum read tracking preferences;
* Per-discussion subscription preferences;
* Per-post read data (if a user has read a post or not); and
* Per-post rating data.

In the case of the rating data, this will include any post where the user has rated the post of another user.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$ratingsql = \core_rating\privacy\provider::get_sql_join('rat', 'mod_forum', 'post', 'p.id', $userid);
// Fetch all forum discussions, and forum posts.
$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
LEFT JOIN {forum_posts} p ON p.discussion = d.id
LEFT JOIN {forum_digests} dig ON dig.forum = f.id
LEFT JOIN {forum_subscriptions} sub ON sub.forum = f.id
LEFT JOIN {forum_track_prefs} pref ON pref.forumid = f.id
LEFT JOIN {forum_read} hasread ON hasread.forumid = f.id
LEFT JOIN {forum_discussion_subs} dsub ON dsub.forum = f.id
{$ratingsql->join}
WHERE (
p.userid = :postuserid OR
d.userid = :discussionuserid OR
dig.userid = :digestuserid OR
sub.userid = :subuserid OR
pref.userid = :prefuserid OR
hasread.userid = :hasreaduserid OR
dsub.userid = :dsubuserid OR
{$ratingsql->userwhere}
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'postuserid' => $userid,
'discussionuserid' => $userid,
'digestuserid' => $userid,
'subuserid' => $userid,
'prefuserid' => $userid,
'hasreaduserid' => $userid,
'dsubuserid' => $userid,
];
$params += $ratingsql->params;

$contextlist = new \core_privacy\local\request\contextlist();
$contextlist->add_from_sql($sql, $params);

return $contextlist;

}
</code>

====Exporting user data====

After determining where in Moodle your plugin holds data about a user, the ''\core_privacy\manager'' will then ask your plugin to export all user data for a subset of those locations.

This is achieved through use of the ''export_user_data'' function which takes the list of approved contexts in a ''\core_privacy\local\request\approved_contextlist'' object.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user data for the specified user, in the specified contexts, using the supplied exporter instance.
*
* @param approved_contextlist $contextlist The approved contexts to export information for.
*/
public static function export_user_data(approved_contextlist $contextlist) {}
</code>

The ''approved_contextlist'' includes both the user record, and a list of contexts, which can be retrieved by either processing it as an Iterator, or by calling ''get_contextids()'' or ''get_contexts()'' as required.

Data is exported using a ''\core_privacy\local\request\content_writer'', which is described in further detail below.

===Plugins which store user preferences===

Many plugins store a variety of user preferences, and must therefore export them.

Since user preferences are a site-wide preference, these are exported separately to other user data.
In some cases the only data present is user preference data, whilst in others there is a combination of user-provided data, and user preferences.

Storing of user preferences is achieved through implementation of the ''\core_privacy\local\request\preference_provider'' interface which defines one required function -- ''export_user_preferences''.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user preferences for the plugin.
*
* @param int $userid The userid of the user whose data is to be exported.
*/
public static function export_user_preferences(int $userid) {
$markasreadonnotification = get_user_preference('markasreadonnotification', null, $userid);
if (null !== $markasreadonnotification) {
switch ($markasreadonnotification) {
case 0:
$markasreadonnotificationdescription = get_string('markasreadonnotificationno', 'mod_forum');
break;
case 1:
default:
$markasreadonnotificationdescription = get_string('markasreadonnotificationyes', 'mod_forum');
break;
}
writer::export_user_preference('mod_forum', 'markasreadonnotification', $markasreadonnotification, $markasreadonnotificationdescription);
}
}
</code>

=== Plugins which can have own subplugins ===

Many plugin types are also able to define their own subplugins and will need to define a contract between themselves and their subplugins in order to fetch their data.

This is required as the parent plugin and the child subplugin should be separate entities and the parent plugin must be able to function if one or more of its subplugins are uninstalled.

The parent plugin is responsible for defining the contract, and for interacting with its subplugins, though we intend to create helpers to make this easier.

The parent plugin should define a new interface for each type of subplugin that it defines. This interface should extend the ''\core_privacy\local\request\plugin\subplugin_provider'' interface.

==== When a parent plugin should and should not provide the interface for its subplugins ====

There can be cases when there is no point for a plugin to provide the "subplugin_provider" based interface, even if it has own subplugins. See the Atto or TinyMCE editors as real examples.

If the parent plugin has no data passed through to the subplugins, there is no benefit in defining a subplugin provider. For example, Atto subplugins are just used to enhance the functionality and they never receive anything like a context. Most of the time we need to define a subplugin provider, but in cases where there is no data passed from the plugin to its subplugins, there is no need to define the subplugin provider. If the subplugins still do store personal data that are not related to the parent plugin in any way, then subplugins should define their own standard provider.

Compare with something like mod_assign where the subplugins store data for the parent and that data is contextually relevant to the parent plugin. In those cases the subplugin stores data for the plugin and it only makes sense to do so in the context of its parent plugin.

====Example====

The following example defines the contract that assign submission subplugins may be required to implement.

The assignment module is responsible for returning the contexts of all assignments where a user has data, but in some cases it is unaware of all of those cases - for example if a Teacher comments on a student submission it may not be aware of these as the information about this interaction may not be stored within its own tables.

''mod/assign/privacy/assignsubmission_provider.php''

<code php>
<?php
// …

namespace mod_assign\privacy;
use \core_privacy\local\metadata\collection;

interface assignsubmission_provider extends
// This Interface defines a subplugin.
\core_privacy\local\request\plugin\subplugin_provider {

/**
* Get the SQL required to find all submission items where this user has had any involvements.
*
* @param int $userid The user to search.
* @return \stdClass Object containing the join, params, and where used to select a these records from the database.
*/
public static function get_items_with_user_interaction(int $userid) : \stdClass ;

/**
* Export all relevant user submissions information which match the combination of userid and attemptid.
*
* @param int $userid The user to search.
* @param \context $context The context to export this submission against.
* @param array $subcontext The subcontext within the context to export this information
* @param int $attid The id of the submission to export.
*/
public static function export_user_submissions(int $userid, \context $context, array $subcontext, int $attid) ;

}
</code>

===Plugins which are subplugins to another plugin===

If you are developing a sub-plugin of another plugin, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subplugin type should define a new interface which extends the ''\core_privacy\local\request\plugin\subplugin_provider'' interface and it is up to the parent plugin to define how they will interact with their children.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
namespace assignsubmission\onlinetext;

class provider implements
// This plugin does store personal user data.
\core_privacy\local\metadata\provider,

// This plugin is a subplugin of assign and must meet that contract.
\mod_assign\privacy\assignsubmission_provider {
}
</code>

===Plugins which are typically called by a Moodle subsystem===

There are a number of plugintypes in Moodle which are typically called by a specific Moodle subsystem.

Some of these are ''only'' called by that subsystem, for example plugins which are of the ''plagiarism'' plugintype should never be called directly, but are always invoked via the ''core_plagiarism'' subsystem.

Conversely, there maybe other plugintypes which can be called both via a subsystem, and in some other fashion. We are still determining whether any plugintypes currently fit this pattern.

If you are developing a plugin which belongs to a specific subsystem, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subsystem will define a new interface which extends the ''\core_privacy\local\request\plugin\subsystem_provider'' interface and it is up to that subsystem to define how they will interact with those plugins.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''plagiarism/detectorator/classes/privacy/provider.php''

<code php>
<?php
// …
namespace plagiarism_detectorator\privacy;

class provider implements
// This plugin does export personal user data.
\core_privacy\local\metadata\provider,

// This plugin is always linked against another activity module via the Plagiarism API.
\core_plagiarism\privacy\plugin_provider {
}
</code>

===Exporting data===

Any plugin which stores data must also export it.

To cater for this the privacy API includes a ''\core_privacy\local\request\content_writer'', which defines a set of functions to store different types of data.

Broadly speaking data is broken into the following types:

* Data - this is the object being described. For example the post content in a forum post;
* Related data - this is data related to the object being stored. For example, ratings of a forum post;
* Metadata - This is metadata about the main object. For example whether you are subscribed to a forum discussion;
* User preferences - this is data about a site-wide preference;
* Files - Any files that you are stored within Moodle on behalf of this plugin; and
* Custom files - For custom file formats - e.g. a calendar feed for calendar data. These should be used sparingly.

Each piece of data is stored against a specific Moodle ''context'', which will define how the data is structured within the exporter.
Data, and Related data only accept the ''stdClass'' object, whilst metadata should be stored as a set of key/value pairs which include a description.

In some cases the data being stored belongs within an implicit structure. For example, one forum has many forum discussions, which each have a number of forum posts. This structure is represented by an ''array'' referred to as a ''subcontext''.

The ''content_writer'' must ''always'' be called with a specific context, and can be called as follows:

''mod/forum/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

writer::with_context($context)
->export_data($subcontext, $post)
->export_area_files($subcontext, 'mod_forum', 'post', $post->id)
->export_metadata($subcontext, 'postread', (object) ['firstread' => $firstread], new \lang_string('privacy:export:post:postread'));
</code>

Any text field which supports Moodle files must also be rewritten:

''mod/forum/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

$post->message = writer::with_context($context)
->rewrite_pluginfile_urls($subcontext, 'mod_forum', 'post', $post->id, $post->message);

</code>

===Providing a way to delete user data===

Deleting user data is also implemented in the request interface. There are two methods that need to be created. The first one to remove all user data from a context, the other to remove user data for a specific user in a list of contexts.

====Delete for a context====

A context is given and all user data (for all users) is to be deleted from the plugin. This will be called when the retention period for the context has expired to adhere to the privacy by design requirement. Retention periods are set in the Data registry.�

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_data_for_all_users_in_context(deletion_criteria $criteria) {
global $DB;
$context = $criteria->get_context();
if (empty($context)) {
return;
}
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid]);
}
</code>

====Delete personal information for a specific user and context(s)====

An ''approved_contextlist'' is given and user data related to that user should either be completely deleted, or overwritten if a structure needs to be maintained. This will be called when a user has requested the right to be forgotten. All attempts should be made to delete this data where practical while still allowing the plugin to be used by other users.

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_data_for_user(approved_contextlist $contextlist) {
global $DB;

if (empty($contextlist->count())) {
return;
}
$userid = $contextlist->get_user()->id;
foreach ($contextlist->get_contexts() as $context) {
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid, 'userid' => $userid]);
}
}
</code>

===Difference between Moodle 3.3 and more recent versions===
Moodle 3.3 has a minimum requirement of php 5.6 and so type hinting and return type declarations are not supported in this version. Consequently the privacy API for this version does not have these features.
====What to do if you have one plugin that supports multiple branches====
This is something that we have considered and we have put in place a polyfill. This gets around the restrictions of one version having type hinting and return type declarations while another does not.

=====Example=====
To use the polyfill include the legacy polyfill trait and create the necessary static methods but with an underscore (shown below).
<code php>
class provider implements
\core_privacy\local\metadata\provider,
\core_privacy\local\request\plugin\provider {

// This trait must be included.
use \core_privacy\local\legacy_polyfill;

// The required methods must be in this format starting with an underscore.
public static function _get_metadata(collection $collection) {
// Code for returning metadata goes here.
}
</code>

====What to do if your must implement a subplugin or subsystem plugin provider====
For subplugins (e.g. assignsubmission, assignfeedback, quiz report, quiz access rules), or subsystems which have a plugintype relationship (portfolio, plagiarism, and others), they will also define their own legacy polyfill.

In this instance you will need to include the trait for both the core polyfill, and the provider polyfill as appropriate.
===== Example =====
<code php>
class provider implements
// This plugin has data and must therefore define the metadata provider in order to describe it.
\core_privacy\local\metadata\provider,

// This is a plagiarism plugin. It interacts with the plagiarism subsystem rather than with core.
\core_plagiarism\privacy\plagiarism_provider {

// This trait must be included to provide the relevant polyfill for the metadata provider.
use \core_privacy\local\legacy_polyfill;

// This trait must be included to provide the relevant polyfill for the plagirism provider.
use \core_plagiarism\privacy\plagiarism_provider\legacy_polyfill;

// The required methods must be in this format starting with an underscore.
public static function _get_metadata(collection $collection) {
// Code for returning metadata goes here.
}

// This is one of the polyfilled methods from the plagiarism provider.
public static function _export_plagiarism_user_data($userid, \context $context, array $subcontext, array $linkarray) {
// ...
}
</code>

== Tips for development ==

* While implementing the privacy API into your plugin, there are CLI scripts that can help you to test things on the fly. Just don't forget these are not supposed to replace proper unit tests. See [[Privacy API/Utilities]] for details.

==See also==

* [[Subject Access Request FAQ]]
* [[:en:GDPR|GDPR]] in the user documentation

[[Category:Privacy]]
[[Category:GDPR]]
[[Category:API]]

GDPR for plugin developers

2018-03-26T10:37:37Z

Dmonllao: /* Delete personal information for a specific user and context(s) */

The General Data Protection Regulation (GDPR) is an EU directive that looks at providing users with more control over their data and how it is processed. This regulation will come into effect on 25th of May 2018 and covers any citizen or permanent resident of the European Union. The directive will be respected by a number of other countries outside of the European Union.

To help institutions become compliant with this new regulation we are adding functionality to Moodle. This includes a number of components, amongst others these include a user’s right to:

* request information on the types of personal data held, the instances of that data, and the deletion policy for each;
* access all of their data; and
* be forgotten.

The compliance requirements also extend to installed plugins (including third party plugins). These need to also be able to report what information they store or process regarding users, and have the ability to provide and delete data for a user request.

This document describes the proposed API changes required for plugins which will allow a Moodle installation to become GDPR compliant.

Target Audience: The intended audience for this document is Moodle plugin developers, who are aiming to ensure their plugins are updated to comply with GDPR requirements coming into effect in the EU in May, 2018.

==Personal data in Moodle==

From the GDPR Spec, Article 4:

''‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;''

In Moodle, we need to consider two main types of personal data; information entered by the user and information stored about the user. The key difference being that information stored about the user will have come from a source other than the user themselves. Both types of data can be used to form a profile of the individual.

The most obvious clue to finding personal data entered by the user is the presence of a userid on a database field. Any data on the record (or linked records) pertaining to that user may be deemed personal data for that user, including things like timestamps and record identification numbers. Additionally, any free text field which allows the user to enter information must also be considered to be the personal data of that user.

Data stored about the user includes things like ratings and comments made on a student submission. These may have been made by an assessor or teacher, but are considered the personal data of the student, as they are considered a reflection of the user’s competency in the subject matter and can be used to form a profile of that individual.

The sections that follow outline what you need to do as a plugin developer to ensure any personal data is advertised and can be accessed and deleted according to the GDPR requirements.

==Background==

===Architecture overview===

A new system for Privacy has been created within Moodle. This is broken down into several main parts and forms the ''core_privacy'' subsystem:

* Some metadata providers - a set of PHP interfaces to be implemented by components for that component to describe the kind of data that it stores, and the purpose for its storage;
* Some request providers - a set of PHP interfaces to be implemented by components to allow that component to act upon user requests such as the Right to be Forgotten, and a Subject Access Request; and
* A manager - a concrete class used to bridge components which implement the providers with tools which request their data.

All plugins will implement one metadata provider, and zero, one or two request providers.

The fetching of data is broken into two separate steps:

# Detecting in which Moodle contexts the user has any data; and
# Exporting all data from each of those contexts.

This has been broken into two steps to later allow administrators to exclude certain contexts from an export - e.g. for courses currently in progress.

A third component will later be added to facilitate the deletion of data within these contexts which will help to satisfy the Right to be Forgotten. This will also use the first step.

===Implementing a provider===

All plugins will need to create a concrete class which implements the relevant metadata and request providers. The exact providers you need to implement will depend on what data you store, and the type of plugin. This is covered in more detail in the following sections of the document.

In order to do so:

# You must create a class called ''provider'' within the namespace ''\your_pluginname\privacy''.
# This class must be created at ''path/to/your/plugin/classes/privacy/provider.php''.
# You must have your class implement the relevant metadata and request interfaces.

==Plugins which do not store personal data==

Many Moodle plugins do not store any personal data. This is usually the case for plugins which just add functionality, or which display the data already stored elsewhere in Moodle.

Some examples of plugin types which might fit this criteria include themes, blocks, filters, editor plugins, etc.

Plugins which cause data to be stored elsewhere in Moodle (e.g. via a subsystem call) are considered to store data.

One examples of a plugin which does not store any data would be the Calendar month block which just displays a view of the user’s calendar. It does not store any data itself.

An example of a plugin which must not use the null provider is the Comments block. The comments block is responsible for data subsequently being stored within Moodle. Although the block doesn’t store anything itself, it interacts with the comments subsystem and is the only component which knows how that data maps to a user.

===Implementation requirements===

In order to let Moodle know that you have audited your plugin, and that you do not store any personal user data, you must implement the ''\core_privacy\local\metadata\null_provider'' interface in your plugin’s provider.

The ''null_provider'' requires you to define one function ''get_reason()'' which returns the language string identifier within your component.

====Example====

''block/calendar_month/classes/privacy/provider.php''

<code php>
<?php
// …

namespace block_calendar_month\privacy;

class provider implements
# This plugin does not store any personal user data.
\core_privacy\local\metadata\null_provider
{

/**
* Get the language string identifier with the component's language
* file to explain why this plugin stores no data.
*
* @return string
*/
public static function get_reason() : string {
return 'privacy:null_reason';
}
}
</code>

''block/calendar_month/lang/en/block_calendar_month.php''

<code php>
<?php

$string['privacy:null_reason'] = 'The calendar month block displays information from the Calendar, but does not effect or store any data itself. All changes are made via the Calendar.';
</code>

That’s it. Congratulations, your plugin now implements the Privacy API.

==Plugins which store personal data==

Many Moodle plugins do store some form of personal data.

In some cases this will be stored within database tables in your plugin, and in other cases this will be in one of Moodle’s core subsystems - for example your plugin may store files, ratings, comments, or tags.

Plugins which do store data will need to:

* Describe the type of data that they store;
* Provide a way to export that data; and
* Provide a way to delete that data.

Data is described via a ''metadata'' provider, and it is both exported and deleted via an implementation of a ''request'' provider.

These are both explained in the sections below.

===Describing the type of data you store===

In order to describe the type of data that you store, you must implement the ''\core_privacy\local\metadata\provider'' interface.

This interfaces requires that you define one function: ''get_metadata''.

There are several types of item to describe the data that you store. These are for:

* Items in the Moodle database;
* Items stored by you in a Moodle subsystem - for example files, and ratings; and
* User preferences stored site-wide within Moodle for your plugin

Note: All fields should include a description from a language string within your plugin.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
<?php
// …

namespace mod_forum\privacy;
use \core_privacy\local\metadata\collection;

class provider implements
# This plugin does store personal user data.
\core_privacy\local\metadata\provider
{
public static function get_metadata(collection $collection) : collection {
return $collection;
}
}
</code>

====Indicating that you store content in a Moodle subsystem====

Many plugins will use one of the core Moodle subsystems to store data.

As a plugin developer we do not expect you to describe those subsystems in detail, but we do need to know that you use them and to know what you use them for.

You can indicate this by calling the ''link_subsystem()'' method on the ''collection''.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->link_subsystem(
'core_files',
'privacy:metadata:core_files'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:core_files'] = 'The forum stores files which have been uploaded by the user to form part of a forum post.';
</code>

====Describing data stored in database tables====

Most Moodle plugins will store some form of user data in their own database tables.

As a plugin developer you will need to describe each database table, and each field which includes user data.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_database_table(
'forum_discussion_subs',
[
'userid' => 'privacy:metadata:forum_discussion_subs:userid',
'discussionid' => 'privacy:metadata:forum_discussion_subs:discussionid',
'preference' => 'privacy:metadata:forum_discussion_subs:preference',

],
'privacy:metadata:forum_discussion_subs'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:forum_discussion_subs'] = 'Information about the subscriptions to individual forum discussions. This includes when a user has chosen to subscribe to a discussion, or to unsubscribe from one where they would otherwise be subscribed.';
$string['privacy:metadata:forum_discussion_subs:userid'] = 'The ID of the user with this subscription preference.';
$string['privacy:metadata:forum_discussion_subs:discussionid'] = 'The ID of the discussion that was subscribed to.';
$string['privacy:metadata:forum_discussion_subs:preference'] = 'The start time of the subscription.';
</code>

====Indicating that you store site-wide user preferences====

Many plugins will include one or more user preferences. Unfortunately this is one of Moodle’s older components and many of the values stored are not pure user preferences. Each plugin should be aware of how it handles its own preferences and is best placed to determine whether they are site-wide preferences, or per-instance preferences.

Whilst most of these will have a fixed name (e.g. ''filepicker_recentrepository''), some will include a variable of some kind (e.g. ''tool_usertours_tour_completion_time_2''). Only the general name needs to be indicated rather than one copy for each preference.

Also, these should only be ''site-wide'' user preferences which do not belong to a specific Moodle context.

In the above examples:

* Preference ''filepicker_recentrepository'' belongs to the file subsystem, and is a site-wide preference affecting the user anywhere that they view the filepicker.
* Preference ''tool_usertours_tour_completion_time_2'' belongs to user tours. User tours are a site-wide feature which can affect many parts of Moodle and cross multiple contexts.

In some cases a value may be stored in the preferences table but is known to belong to a specific context within Moodle. In these cases they should be stored as metadata against that context rather than as a site-wide user preference.

You can indicate this by calling the ''add_user_preference()'' method on the ''collection''.

Any plugin providing user preferences must also implement the ''\core_privacy\local\request\preference_provider''.

=====Example=====

''admin/tool/usertours/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_user_preference('tool_usertours_tour_completion_time,
'privacy:metadata:preference:tool_usertours_tour_completion_time');

return $collection;
}
</code>

''admin/tool/usertours/lang/en/tool_usertours.php''

<code php>
<?php

$string['privacy:metadata:tool_usertours_tour_completion_time'] = 'The time that a specific user tour was last completed by a user.';
</code>

====Indicating that you export data to an external location====

Many plugins will interact with external systems - for example cloud-based services. Often this external location is configurable within the plugin either at the site or the instance level.

As a plugin developer you will need to describe each ''type'' of target destination, alongside a list of each exported field which includes user data.
The ''actual'' destination does not need to be described as this can change based on configuration.

You can indicate this by calling the ''link_external_location()'' method on the collection.

=====Example=====

''mod/lti/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->link_external_location('lti_client', [
'userid' => 'privacy:metadata:lti_client:userid',
'fullname' => 'privacy:metadata:lti_client:fullname',
], 'privacy:metadata:lti_client');

return $collection;
}
</code>

''admin/tool/usertours/lang/en/tool_usertours.php''

<code php>
<?php

$string['privacy:metadata:lti_client'] = 'In order to integrate with a remote LTI service, user data needs to be exchanged with that service.';
$string['privacy:metadata:lti_client:userid'] = 'The userid is sent from Moodle to allow you to access your data on the remote system.';
$string['privacy:metadata:lti_client:fullname'] = 'Your full name is sent to the remote system to allow a better user experience.';
</code>

===Providing a way to export user data===

In order to export the user data that you store, you must implement the relevant request provider.

We have named these request providers because they are called in response to a specific request from a user to access their information.

There are several different types of request provider, and you may need to implement several of these, depending on the type and nature of your plugin.

Broadly speaking plugins will fit into one of the following categories:

* Plugins which are a subplugin of another plugin. Examples include ''assignsubmission'', ''atto'', and ''datafield'';
* Plugins which are typically called by a Moodle subsystem. Examples include ''qtype'', and ''profilefield'';
* All other plugins which store data.

Most plugins will fit into this final category, whilst other plugins may fall into several categories.
Plugins which ''define'' a subplugin will also be responsible for collecting this data from their subplugins.

A final category exists - plugins which store user preferences. In some cases this may be the ''only'' provider implemented.

====Standard plugins which store data====

A majority of Moodle plugins will fit into this category and will be required to implement the ''\core_privacy\local\request\plugin\provider'' interface. This interface requires that you define two functions:

* ''get_contexts_for_userid'' - to explain where data is held within Moodle for your plugin; and
* ''export_user_data'' - to export a user’s personal data from your plugin.

These APIs make use of the Moodle ''context'' system to hierarchically store this data.

====Retrieving the list of contexts====

Contexts are retrieved using the ''get_contexts_for_userid'' function which takes the ID of the user being fetched, and returns a list of contexts in which the user has any data.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {}
</code>

The function returns a ''\core_privacy\local\request\contextlist'' which is used to keep a set of contexts together in a fixed fashion.

Because a Subject Access Request covers ''every'' piece of data that is held for a user within Moodle, efficiency and performance is highly important. As a result, contexts are added to the ''contextlist'' by defining one or more SQL queries which return just the contextid. Multiple SQL queries can be added as required.

Many plugins will interact with specific subsystems and store data within them.
These subsystems will also provide a way in which to link the data that you have stored with your own database tables.
At present these are still a work in progress and only the ''core_ratings'' subsystem includes this.

=====Basic example=====

The following example simply fetches the contextid for all forums where a user has a single discussion (note: this is an incomplete example):

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$contextlist = new \core_privacy\local\request\contextlist();

$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
WHERE (
d.userid = :discussionuserid
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'discussionuserid' => $userid,
];

$contextlist->add_from_sql($sql, $params);
}
</code>

=====More complete example=====

The following example includes a link to core_rating.
It will find any forum, forum discussion, or forum post where the user has any data, including:

* Per-forum digest preferences;
* Per-forum subscription preferences;
* Per-forum read tracking preferences;
* Per-discussion subscription preferences;
* Per-post read data (if a user has read a post or not); and
* Per-post rating data.

In the case of the rating data, this will include any post where the user has rated the post of another user.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$ratingsql = \core_rating\privacy\provider::get_sql_join('rat', 'mod_forum', 'post', 'p.id', $userid);
// Fetch all forum discussions, and forum posts.
$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
LEFT JOIN {forum_posts} p ON p.discussion = d.id
LEFT JOIN {forum_digests} dig ON dig.forum = f.id
LEFT JOIN {forum_subscriptions} sub ON sub.forum = f.id
LEFT JOIN {forum_track_prefs} pref ON pref.forumid = f.id
LEFT JOIN {forum_read} hasread ON hasread.forumid = f.id
LEFT JOIN {forum_discussion_subs} dsub ON dsub.forum = f.id
{$ratingsql->join}
WHERE (
p.userid = :postuserid OR
d.userid = :discussionuserid OR
dig.userid = :digestuserid OR
sub.userid = :subuserid OR
pref.userid = :prefuserid OR
hasread.userid = :hasreaduserid OR
dsub.userid = :dsubuserid OR
{$ratingsql->userwhere}
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'postuserid' => $userid,
'discussionuserid' => $userid,
'digestuserid' => $userid,
'subuserid' => $userid,
'prefuserid' => $userid,
'hasreaduserid' => $userid,
'dsubuserid' => $userid,
];
$params += $ratingsql->params;

$contextlist = new \core_privacy\local\request\contextlist();
$contextlist->add_from_sql($sql, $params);

return $contextlist;

}
</code>

====Exporting user data====

After determining where in Moodle your plugin holds data about a user, the ''\core_privacy\manager'' will then ask your plugin to export all user data for a subset of those locations.

This is achieved through use of the ''export_user_data'' function which takes the list of approved contexts in a ''\core_privacy\local\request\approved_contextlist'' object.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user data for the specified user, in the specified contexts, using the supplied exporter instance.
*
* @param approved_contextlist $contextlist The approved contexts to export information for.
*/
public static function export_user_data(approved_contextlist $contextlist) {}
</code>

The ''approved_contextlist'' includes both the user record, and a list of contexts, which can be retrieved by either processing it as an Iterator, or by calling ''get_contextids()'' or ''get_contexts()'' as required.

Data is exported using a ''\core_privacy\local\request\content_writer'', which is described in further detail below.

===Plugins which store user preferences===

Many plugins store a variety of user preferences, and must therefore export them.

Since user preferences are a site-wide preference, these are exported separately to other user data.
In some cases the only data present is user preference data, whilst in others there is a combination of user-provided data, and user preferences.

Storing of user preferences is achieved through implementation of the ''\core_privacy\local\request\preference_provider'' interface which defines one required function -- ''export_user_preferences''.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user preferences for the plugin.
*
* @param int $userid The userid of the user whose data is to be exported.
*/
public static function export_user_preferences(int $userid) {
$markasreadonnotification = get_user_preference('markasreadonnotification', null, $userid);
if (null !== $markasreadonnotification) {
switch ($markasreadonnotification) {
case 0:
$markasreadonnotificationdescription = get_string('markasreadonnotificationno', 'mod_forum');
break;
case 1:
default:
$markasreadonnotificationdescription = get_string('markasreadonnotificationyes', 'mod_forum');
break;
}
writer::export_user_preference('mod_forum', 'markasreadonnotification', $markasreadonnotification, $markasreadonnotificationdescription);
}
}
</code>

===Plugins which define a subplugin===

Many plugin types are also able to define their own subplugins and will need to define a contract between themselves and their subplugins in order to fetch their data.

This is required as the parent plugin and the child subplugin should be separate entities and the parent plugin must be able to function if one or more of its subplugins are uninstalled.

The parent plugin is responsible for defining the contract, and for interacting with its subplugins, though we intend to create helpers to make this easier.

The parent plugin should define a new interface for each type of subplugin that it defines. This interface should extend the ''\core_privacy\local\request\plugin\subplugin_provider'' interface.

====Example====

The following example defines the contract that assign submission subplugins may be required to implement.

The assignment module is responsible for returning the contexts of all assignments where a user has data, but in some cases it is unaware of all of those cases - for example if a Teacher comments on a student submission it may not be aware of these as the information about this interaction may not be stored within its own tables.

''mod/assign/privacy/assignsubmission_provider.php''

<code php>
<?php
// …

namespace mod_assign\privacy;
use \core_privacy\local\metadata\collection;

interface assignsubmission_provider extends
# This Interface defines a subplugin.
\core_privacy\local\request\subplugin_provider
{

/**
* Get the SQL required to find all submission items where this user has had any involvements.
*
* @param int $userid The user to search.
* @return \stdClass Object containing the join, params, and where used to select a these records from the database.
*/
public static function get_items_with_user_interaction(int $userid) : \stdClass ;

/**
* Export all relevant user submissions information which match the combination of userid and attemptid.
*
* @param int $userid The user to search.
* @param \context $context The context to export this submission against.
* @param array $subcontext The subcontext within the context to export this information
* @param int $attid The id of the submission to export.
*/
public static function export_user_submissions(int $userid, \context $context, array $subcontext, int $attid) ;

}
</code>

===Plugins which are subplugins to another plugin===

If you are developing a sub-plugin of another plugin, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subplugin type should define a new interface which extends the ''\core_privacy\local\request\plugin\subplugin_provider'' interface and it is up to the parent plugin to define how they will interact with their children.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
namespace assignsubmission\onlinetext;

class provider implements
# This plugin does store personal user data.
\core_privacy\local\metadata\provider,

# This plugin is a subplugin of assign and must meet that contract.
\mod_assign\privacy\assignsubmission_provider
{
}
</code>

===Plugins which are typically called by a Moodle subsystem===

There are a number of plugintypes in Moodle which are typically called by a specific Moodle subsystem.

Some of these are ''only'' called by that subsystem, for example plugins which are of the ''plagiarism'' plugintype should never be called directly, but are always invoked via the ''core_plagiarism'' subsystem.

Conversely, there maybe other plugintypes which can be called both via a subsystem, and in some other fashion. We are still determining whether any plugintypes currently fit this pattern.

If you are developing a plugin which belongs to a specific subsystem, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subsystem will define a new interface which extends the ''\core_privacy\local\request\plugin\subsystem_provider'' interface and it is up to that subsystem to define how they will interact with those plugins.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''plagiarism/detectorator/classes/privacy/provider.php''

<code php>
<?php
// …
namespace plagiarism_detectorator\privacy;

class provider implements
# This plugin does export personal user data.
\core_privacy\local\metadata\provider,

# This plugin is always linked against another activity module via the Plagiarism API.
\core_plagiarism\privacy\plugin_provider
{
}
</code>

===Exporting data===

Any plugin which stores data must also export it.

To cater for this the privacy API includes a ''\core_privacy\local\request\content_writer'', which defines a set of functions to store different types of data.

Broadly speaking data is broken into the following types:

* Data - this is the object being described. For example the post content in a forum post;
* Related data - this is data related to the object being stored. For example, ratings of a forum post;
* Metadata - This is metadata about the main object. For example whether you are subscribed to a forum discussion;
* User preferences - this is data about a site-wide preference;
* Files - Any files that you are stored within Moodle on behalf of this plugin; and
* Custom files - For custom file formats - e.g. a calendar feed for calendar data. These should be used sparingly.

Each piece of data is stored against a specific Moodle ''context'', which will define how the data is structured within the exporter.
Data, and Related data only accept the ''stdClass'' object, whilst metadata should be stored as a set of key/value pairs which include a description.

In some cases the data being stored belongs within an implicit structure. For example, One forum has many forum discussions, which each have a number of forum posts. This structure is represented by an ''array'' referred to as a ''subcontext''.

The ''content_writer'' must ''always'' be called with a specific context, and can be called as follows:

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

writer::with_context($context)
->export_data($subcontext, $post)
->export_area_files($subcontext, 'mod_forum', 'post', $post->id)
->export_metadata($subcontext, 'postread', (object) ['firstread' => $firstread], new \lang_string('privacy:export:post:postread'));
<code>

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

writer::with_context($context)
->export_data($subcontext, $post)
->export_area_files($subcontext, 'mod_forum', 'post', $post->id)
->export_metadata($subcontext, 'postread', (object) ['firstread' => $firstread], new \lang_string('privacy:export:post:postread'));
</code>

===Providing a way to delete user data===

Deleting user data is also implemented in the request interface. There are two methods that need to be created. The first one to remove all user data from a context, the other to remove user data for a specific user in a list of contexts.

====Delete for a context====

A context is given and all user data (for all users) is to be deleted from the plugin. This will be called when the retention period for the plugin has expired to adhere to the privacy by design requirement.�

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_data_for_all_users_in_context(deletion_criteria $criteria) {
global $DB;
$context = $criteria->get_context();
if (empty($context)) {
return;
}
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid]);
}
</code>

====Delete personal information for a specific user and context(s)====

An ''approved_contextlist'' is given and user data related to that user should either be completely deleted, or overwritten if a structure needs to be maintained. This will be called when a user has requested the right to be forgotten. All attempts should be made to delete this data where practical while still allowing the plugin to be used by other users.

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_data_for_user(approved_contextlist $contextlist) {
global $DB;

if (empty($contextlist->count())) {
return;
}
$userid = $contextlist->get_user()->id;
foreach ($contextlist->get_contexts() as $context) {
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid, 'userid' => $userid]);
}
}
</code>

==See also==

* [[Subject Access Request FAQ]]
* [[Privacy API]]
* [[:en:GDPR|GDPR]] in the user documentation

[[Category:GDPR]]

GDPR for plugin developers

2018-03-26T09:41:11Z

Dmonllao: /* Delete for a context */

The General Data Protection Regulation (GDPR) is an EU directive that looks at providing users with more control over their data and how it is processed. This regulation will come into effect on 25th of May 2018 and covers any citizen or permanent resident of the European Union. The directive will be respected by a number of other countries outside of the European Union.

To help institutions become compliant with this new regulation we are adding functionality to Moodle. This includes a number of components, amongst others these include a user’s right to:

* request information on the types of personal data held, the instances of that data, and the deletion policy for each;
* access all of their data; and
* be forgotten.

The compliance requirements also extend to installed plugins (including third party plugins). These need to also be able to report what information they store or process regarding users, and have the ability to provide and delete data for a user request.

This document describes the proposed API changes required for plugins which will allow a Moodle installation to become GDPR compliant.

Target Audience: The intended audience for this document is Moodle plugin developers, who are aiming to ensure their plugins are updated to comply with GDPR requirements coming into effect in the EU in May, 2018.

==Personal data in Moodle==

From the GDPR Spec, Article 4:

''‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;''

In Moodle, we need to consider two main types of personal data; information entered by the user and information stored about the user. The key difference being that information stored about the user will have come from a source other than the user themselves. Both types of data can be used to form a profile of the individual.

The most obvious clue to finding personal data entered by the user is the presence of a userid on a database field. Any data on the record (or linked records) pertaining to that user may be deemed personal data for that user, including things like timestamps and record identification numbers. Additionally, any free text field which allows the user to enter information must also be considered to be the personal data of that user.

Data stored about the user includes things like ratings and comments made on a student submission. These may have been made by an assessor or teacher, but are considered the personal data of the student, as they are considered a reflection of the user’s competency in the subject matter and can be used to form a profile of that individual.

The sections that follow outline what you need to do as a plugin developer to ensure any personal data is advertised and can be accessed and deleted according to the GDPR requirements.

==Background==

===Architecture overview===

A new system for Privacy has been created within Moodle. This is broken down into several main parts and forms the ''core_privacy'' subsystem:

* Some metadata providers - a set of PHP interfaces to be implemented by components for that component to describe the kind of data that it stores, and the purpose for its storage;
* Some request providers - a set of PHP interfaces to be implemented by components to allow that component to act upon user requests such as the Right to be Forgotten, and a Subject Access Request; and
* A manager - a concrete class used to bridge components which implement the providers with tools which request their data.

All plugins will implement one metadata provider, and zero, one or two request providers.

The fetching of data is broken into two separate steps:

# Detecting in which Moodle contexts the user has any data; and
# Exporting all data from each of those contexts.

This has been broken into two steps to later allow administrators to exclude certain contexts from an export - e.g. for courses currently in progress.

A third component will later be added to facilitate the deletion of data within these contexts which will help to satisfy the Right to be Forgotten. This will also use the first step.

===Implementing a provider===

All plugins will need to create a concrete class which implements the relevant metadata and request providers. The exact providers you need to implement will depend on what data you store, and the type of plugin. This is covered in more detail in the following sections of the document.

In order to do so:

# You must create a class called ''provider'' within the namespace ''\your_pluginname\privacy''.
# This class must be created at ''path/to/your/plugin/classes/privacy/provider.php''.
# You must have your class implement the relevant metadata and request interfaces.

==Plugins which do not store personal data==

Many Moodle plugins do not store any personal data. This is usually the case for plugins which just add functionality, or which display the data already stored elsewhere in Moodle.

Some examples of plugin types which might fit this criteria include themes, blocks, filters, editor plugins, etc.

Plugins which cause data to be stored elsewhere in Moodle (e.g. via a subsystem call) are considered to store data.

One examples of a plugin which does not store any data would be the Calendar month block which just displays a view of the user’s calendar. It does not store any data itself.

An example of a plugin which must not use the null provider is the Comments block. The comments block is responsible for data subsequently being stored within Moodle. Although the block doesn’t store anything itself, it interacts with the comments subsystem and is the only component which knows how that data maps to a user.

===Implementation requirements===

In order to let Moodle know that you have audited your plugin, and that you do not store any personal user data, you must implement the ''\core_privacy\local\metadata\null_provider'' interface in your plugin’s provider.

The ''null_provider'' requires you to define one function ''get_reason()'' which returns the language string identifier within your component.

====Example====

''block/calendar_month/classes/privacy/provider.php''

<code php>
<?php
// …

namespace block_calendar_month\privacy;

class provider implements
# This plugin does not store any personal user data.
\core_privacy\local\metadata\null_provider
{

/**
* Get the language string identifier with the component's language
* file to explain why this plugin stores no data.
*
* @return string
*/
public static function get_reason() : string {
return 'privacy:null_reason';
}
}
</code>

''block/calendar_month/lang/en/block_calendar_month.php''

<code php>
<?php

$string['privacy:null_reason'] = 'The calendar month block displays information from the Calendar, but does not effect or store any data itself. All changes are made via the Calendar.';
</code>

That’s it. Congratulations, your plugin now implements the Privacy API.

==Plugins which store personal data==

Many Moodle plugins do store some form of personal data.

In some cases this will be stored within database tables in your plugin, and in other cases this will be in one of Moodle’s core subsystems - for example your plugin may store files, ratings, comments, or tags.

Plugins which do store data will need to:

* Describe the type of data that they store;
* Provide a way to export that data; and
* Provide a way to delete that data.

Data is described via a ''metadata'' provider, and it is both exported and deleted via an implementation of a ''request'' provider.

These are both explained in the sections below.

===Describing the type of data you store===

In order to describe the type of data that you store, you must implement the ''\core_privacy\local\metadata\provider'' interface.

This interfaces requires that you define one function: ''get_metadata''.

There are several types of item to describe the data that you store. These are for:

* Items in the Moodle database;
* Items stored by you in a Moodle subsystem - for example files, and ratings; and
* User preferences stored site-wide within Moodle for your plugin

Note: All fields should include a description from a language string within your plugin.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
<?php
// …

namespace mod_forum\privacy;
use \core_privacy\local\metadata\collection;

class provider implements
# This plugin does store personal user data.
\core_privacy\local\metadata\provider
{
public static function get_metadata(collection $collection) : collection {
return $collection;
}
}
</code>

====Indicating that you store content in a Moodle subsystem====

Many plugins will use one of the core Moodle subsystems to store data.

As a plugin developer we do not expect you to describe those subsystems in detail, but we do need to know that you use them and to know what you use them for.

You can indicate this by calling the ''link_subsystem()'' method on the ''collection''.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->link_subsystem(
'core_files',
'privacy:metadata:core_files'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:core_files'] = 'The forum stores files which have been uploaded by the user to form part of a forum post.';
</code>

====Describing data stored in database tables====

Most Moodle plugins will store some form of user data in their own database tables.

As a plugin developer you will need to describe each database table, and each field which includes user data.

=====Example=====

''mod/forum/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_database_table(
'forum_discussion_subs',
[
'userid' => 'privacy:metadata:forum_discussion_subs:userid',
'discussionid' => 'privacy:metadata:forum_discussion_subs:discussionid',
'preference' => 'privacy:metadata:forum_discussion_subs:preference',

],
'privacy:metadata:forum_discussion_subs'
);

return $collection;
}
</code>

''mod/forum/lang/en/forum.php''

<code php>
<?php

$string['privacy:metadata:forum_discussion_subs'] = 'Information about the subscriptions to individual forum discussions. This includes when a user has chosen to subscribe to a discussion, or to unsubscribe from one where they would otherwise be subscribed.';
$string['privacy:metadata:forum_discussion_subs:userid'] = 'The ID of the user with this subscription preference.';
$string['privacy:metadata:forum_discussion_subs:discussionid'] = 'The ID of the discussion that was subscribed to.';
$string['privacy:metadata:forum_discussion_subs:preference'] = 'The start time of the subscription.';
</code>

====Indicating that you store site-wide user preferences====

Many plugins will include one or more user preferences. Unfortunately this is one of Moodle’s older components and many of the values stored are not pure user preferences. Each plugin should be aware of how it handles its own preferences and is best placed to determine whether they are site-wide preferences, or per-instance preferences.

Whilst most of these will have a fixed name (e.g. ''filepicker_recentrepository''), some will include a variable of some kind (e.g. ''tool_usertours_tour_completion_time_2''). Only the general name needs to be indicated rather than one copy for each preference.

Also, these should only be ''site-wide'' user preferences which do not belong to a specific Moodle context.

In the above examples:

* Preference ''filepicker_recentrepository'' belongs to the file subsystem, and is a site-wide preference affecting the user anywhere that they view the filepicker.
* Preference ''tool_usertours_tour_completion_time_2'' belongs to user tours. User tours are a site-wide feature which can affect many parts of Moodle and cross multiple contexts.

In some cases a value may be stored in the preferences table but is known to belong to a specific context within Moodle. In these cases they should be stored as metadata against that context rather than as a site-wide user preference.

You can indicate this by calling the ''add_user_preference()'' method on the ''collection''.

Any plugin providing user preferences must also implement the ''\core_privacy\local\request\preference_provider''.

=====Example=====

''admin/tool/usertours/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->add_user_preference('tool_usertours_tour_completion_time,
'privacy:metadata:preference:tool_usertours_tour_completion_time');

return $collection;
}
</code>

''admin/tool/usertours/lang/en/tool_usertours.php''

<code php>
<?php

$string['privacy:metadata:tool_usertours_tour_completion_time'] = 'The time that a specific user tour was last completed by a user.';
</code>

====Indicating that you export data to an external location====

Many plugins will interact with external systems - for example cloud-based services. Often this external location is configurable within the plugin either at the site or the instance level.

As a plugin developer you will need to describe each ''type'' of target destination, alongside a list of each exported field which includes user data.
The ''actual'' destination does not need to be described as this can change based on configuration.

You can indicate this by calling the ''link_external_location()'' method on the collection.

=====Example=====

''mod/lti/classes/privacy/provider.php''

<code php>
public static function get_metadata(collection $collection) : collection {

$collection->link_external_location('lti_client', [
'userid' => 'privacy:metadata:lti_client:userid',
'fullname' => 'privacy:metadata:lti_client:fullname',
], 'privacy:metadata:lti_client');

return $collection;
}
</code>

''admin/tool/usertours/lang/en/tool_usertours.php''

<code php>
<?php

$string['privacy:metadata:lti_client'] = 'In order to integrate with a remote LTI service, user data needs to be exchanged with that service.';
$string['privacy:metadata:lti_client:userid'] = 'The userid is sent from Moodle to allow you to access your data on the remote system.';
$string['privacy:metadata:lti_client:fullname'] = 'Your full name is sent to the remote system to allow a better user experience.';
</code>

===Providing a way to export user data===

In order to export the user data that you store, you must implement the relevant request provider.

We have named these request providers because they are called in response to a specific request from a user to access their information.

There are several different types of request provider, and you may need to implement several of these, depending on the type and nature of your plugin.

Broadly speaking plugins will fit into one of the following categories:

* Plugins which are a subplugin of another plugin. Examples include ''assignsubmission'', ''atto'', and ''datafield'';
* Plugins which are typically called by a Moodle subsystem. Examples include ''qtype'', and ''profilefield'';
* All other plugins which store data.

Most plugins will fit into this final category, whilst other plugins may fall into several categories.
Plugins which ''define'' a subplugin will also be responsible for collecting this data from their subplugins.

A final category exists - plugins which store user preferences. In some cases this may be the ''only'' provider implemented.

====Standard plugins which store data====

A majority of Moodle plugins will fit into this category and will be required to implement the ''\core_privacy\local\request\plugin\provider'' interface. This interface requires that you define two functions:

* ''get_contexts_for_userid'' - to explain where data is held within Moodle for your plugin; and
* ''export_user_data'' - to export a user’s personal data from your plugin.

These APIs make use of the Moodle ''context'' system to hierarchically store this data.

====Retrieving the list of contexts====

Contexts are retrieved using the ''get_contexts_for_userid'' function which takes the ID of the user being fetched, and returns a list of contexts in which the user has any data.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {}
</code>

The function returns a ''\core_privacy\local\request\contextlist'' which is used to keep a set of contexts together in a fixed fashion.

Because a Subject Access Request covers ''every'' piece of data that is held for a user within Moodle, efficiency and performance is highly important. As a result, contexts are added to the ''contextlist'' by defining one or more SQL queries which return just the contextid. Multiple SQL queries can be added as required.

Many plugins will interact with specific subsystems and store data within them.
These subsystems will also provide a way in which to link the data that you have stored with your own database tables.
At present these are still a work in progress and only the ''core_ratings'' subsystem includes this.

=====Basic example=====

The following example simply fetches the contextid for all forums where a user has a single discussion (note: this is an incomplete example):

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$contextlist = new \core_privacy\local\request\contextlist();

$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
WHERE (
d.userid = :discussionuserid
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'discussionuserid' => $userid,
];

$contextlist->add_from_sql($sql, $params);
}
</code>

=====More complete example=====

The following example includes a link to core_rating.
It will find any forum, forum discussion, or forum post where the user has any data, including:

* Per-forum digest preferences;
* Per-forum subscription preferences;
* Per-forum read tracking preferences;
* Per-discussion subscription preferences;
* Per-post read data (if a user has read a post or not); and
* Per-post rating data.

In the case of the rating data, this will include any post where the user has rated the post of another user.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Get the list of contexts that contain user information for the specified user.
*
* @param int $userid The user to search.
* @return contextlist $contextlist The list of contexts used in this plugin.
*/
public static function get_contexts_for_userid(int $userid) : contextlist {
$ratingsql = \core_rating\privacy\provider::get_sql_join('rat', 'mod_forum', 'post', 'p.id', $userid);
// Fetch all forum discussions, and forum posts.
$sql = "SELECT c.id
FROM {context} c
INNER JOIN {course_modules} cm ON cm.id = c.instanceid AND c.contextlevel = :contextlevel
INNER JOIN {modules} m ON m.id = cm.module AND m.name = :modname
INNER JOIN {forum} f ON f.id = cm.instance
LEFT JOIN {forum_discussions} d ON d.forum = f.id
LEFT JOIN {forum_posts} p ON p.discussion = d.id
LEFT JOIN {forum_digests} dig ON dig.forum = f.id
LEFT JOIN {forum_subscriptions} sub ON sub.forum = f.id
LEFT JOIN {forum_track_prefs} pref ON pref.forumid = f.id
LEFT JOIN {forum_read} hasread ON hasread.forumid = f.id
LEFT JOIN {forum_discussion_subs} dsub ON dsub.forum = f.id
{$ratingsql->join}
WHERE (
p.userid = :postuserid OR
d.userid = :discussionuserid OR
dig.userid = :digestuserid OR
sub.userid = :subuserid OR
pref.userid = :prefuserid OR
hasread.userid = :hasreaduserid OR
dsub.userid = :dsubuserid OR
{$ratingsql->userwhere}
)
";

$params = [
'modname' => 'forum',
'contextlevel' => CONTEXT_MODULE,
'postuserid' => $userid,
'discussionuserid' => $userid,
'digestuserid' => $userid,
'subuserid' => $userid,
'prefuserid' => $userid,
'hasreaduserid' => $userid,
'dsubuserid' => $userid,
];
$params += $ratingsql->params;

$contextlist = new \core_privacy\local\request\contextlist();
$contextlist->add_from_sql($sql, $params);

return $contextlist;

}
</code>

====Exporting user data====

After determining where in Moodle your plugin holds data about a user, the ''\core_privacy\manager'' will then ask your plugin to export all user data for a subset of those locations.

This is achieved through use of the ''export_user_data'' function which takes the list of approved contexts in a ''\core_privacy\local\request\approved_contextlist'' object.

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user data for the specified user, in the specified contexts, using the supplied exporter instance.
*
* @param approved_contextlist $contextlist The approved contexts to export information for.
*/
public static function export_user_data(approved_contextlist $contextlist) {}
</code>

The ''approved_contextlist'' includes both the user record, and a list of contexts, which can be retrieved by either processing it as an Iterator, or by calling ''get_contextids()'' or ''get_contexts()'' as required.

Data is exported using a ''\core_privacy\local\request\content_writer'', which is described in further detail below.

===Plugins which store user preferences===

Many plugins store a variety of user preferences, and must therefore export them.

Since user preferences are a site-wide preference, these are exported separately to other user data.
In some cases the only data present is user preference data, whilst in others there is a combination of user-provided data, and user preferences.

Storing of user preferences is achieved through implementation of the ''\core_privacy\local\request\preference_provider'' interface which defines one required function -- ''export_user_preferences''.

====Example====

''mod/forum/classes/privacy/provider.php''

<code php>
/**
* Export all user preferences for the plugin.
*
* @param int $userid The userid of the user whose data is to be exported.
*/
public static function export_user_preferences(int $userid) {
$markasreadonnotification = get_user_preference('markasreadonnotification', null, $userid);
if (null !== $markasreadonnotification) {
switch ($markasreadonnotification) {
case 0:
$markasreadonnotificationdescription = get_string('markasreadonnotificationno', 'mod_forum');
break;
case 1:
default:
$markasreadonnotificationdescription = get_string('markasreadonnotificationyes', 'mod_forum');
break;
}
writer::export_user_preference('mod_forum', 'markasreadonnotification', $markasreadonnotification, $markasreadonnotificationdescription);
}
}
</code>

===Plugins which define a subplugin===

Many plugin types are also able to define their own subplugins and will need to define a contract between themselves and their subplugins in order to fetch their data.

This is required as the parent plugin and the child subplugin should be separate entities and the parent plugin must be able to function if one or more of its subplugins are uninstalled.

The parent plugin is responsible for defining the contract, and for interacting with its subplugins, though we intend to create helpers to make this easier.

The parent plugin should define a new interface for each type of subplugin that it defines. This interface should extend the ''\core_privacy\local\request\plugin\subplugin_provider'' interface.

====Example====

The following example defines the contract that assign submission subplugins may be required to implement.

The assignment module is responsible for returning the contexts of all assignments where a user has data, but in some cases it is unaware of all of those cases - for example if a Teacher comments on a student submission it may not be aware of these as the information about this interaction may not be stored within its own tables.

''mod/assign/privacy/assignsubmission_provider.php''

<code php>
<?php
// …

namespace mod_assign\privacy;
use \core_privacy\local\metadata\collection;

interface assignsubmission_provider extends
# This Interface defines a subplugin.
\core_privacy\local\request\subplugin_provider
{

/**
* Get the SQL required to find all submission items where this user has had any involvements.
*
* @param int $userid The user to search.
* @return \stdClass Object containing the join, params, and where used to select a these records from the database.
*/
public static function get_items_with_user_interaction(int $userid) : \stdClass ;

/**
* Export all relevant user submissions information which match the combination of userid and attemptid.
*
* @param int $userid The user to search.
* @param \context $context The context to export this submission against.
* @param array $subcontext The subcontext within the context to export this information
* @param int $attid The id of the submission to export.
*/
public static function export_user_submissions(int $userid, \context $context, array $subcontext, int $attid) ;

}
</code>

===Plugins which are subplugins to another plugin===

If you are developing a sub-plugin of another plugin, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subplugin type should define a new interface which extends the ''\core_privacy\local\request\plugin\subplugin_provider'' interface and it is up to the parent plugin to define how they will interact with their children.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
namespace assignsubmission\onlinetext;

class provider implements
# This plugin does store personal user data.
\core_privacy\local\metadata\provider,

# This plugin is a subplugin of assign and must meet that contract.
\mod_assign\privacy\assignsubmission_provider
{
}
</code>

===Plugins which are typically called by a Moodle subsystem===

There are a number of plugintypes in Moodle which are typically called by a specific Moodle subsystem.

Some of these are ''only'' called by that subsystem, for example plugins which are of the ''plagiarism'' plugintype should never be called directly, but are always invoked via the ''core_plagiarism'' subsystem.

Conversely, there maybe other plugintypes which can be called both via a subsystem, and in some other fashion. We are still determining whether any plugintypes currently fit this pattern.

If you are developing a plugin which belongs to a specific subsystem, then you will have to look at the relevant plugin in order to determine the exact contract.

Each subsystem will define a new interface which extends the ''\core_privacy\local\request\plugin\subsystem_provider'' interface and it is up to that subsystem to define how they will interact with those plugins.

The principles remain the same, but the exact implementation will differ depending upon requirements.

''plagiarism/detectorator/classes/privacy/provider.php''

<code php>
<?php
// …
namespace plagiarism_detectorator\privacy;

class provider implements
# This plugin does export personal user data.
\core_privacy\local\metadata\provider,

# This plugin is always linked against another activity module via the Plagiarism API.
\core_plagiarism\privacy\plugin_provider
{
}
</code>

===Exporting data===

Any plugin which stores data must also export it.

To cater for this the privacy API includes a ''\core_privacy\local\request\content_writer'', which defines a set of functions to store different types of data.

Broadly speaking data is broken into the following types:

* Data - this is the object being described. For example the post content in a forum post;
* Related data - this is data related to the object being stored. For example, ratings of a forum post;
* Metadata - This is metadata about the main object. For example whether you are subscribed to a forum discussion;
* User preferences - this is data about a site-wide preference;
* Files - Any files that you are stored within Moodle on behalf of this plugin; and
* Custom files - For custom file formats - e.g. a calendar feed for calendar data. These should be used sparingly.

Each piece of data is stored against a specific Moodle ''context'', which will define how the data is structured within the exporter.
Data, and Related data only accept the ''stdClass'' object, whilst metadata should be stored as a set of key/value pairs which include a description.

In some cases the data being stored belongs within an implicit structure. For example, One forum has many forum discussions, which each have a number of forum posts. This structure is represented by an ''array'' referred to as a ''subcontext''.

The ''content_writer'' must ''always'' be called with a specific context, and can be called as follows:

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

writer::with_context($context)
->export_data($subcontext, $post)
->export_area_files($subcontext, 'mod_forum', 'post', $post->id)
->export_metadata($subcontext, 'postread', (object) ['firstread' => $firstread], new \lang_string('privacy:export:post:postread'));
<code>

''mod/pluginname/classes/privacy/provider.php''

<code php>
<?php
// …
use \core_privacy\local\request\writer;

writer::with_context($context)
->export_data($subcontext, $post)
->export_area_files($subcontext, 'mod_forum', 'post', $post->id)
->export_metadata($subcontext, 'postread', (object) ['firstread' => $firstread], new \lang_string('privacy:export:post:postread'));
</code>

===Providing a way to delete user data===

Deleting user data is also implemented in the request interface. There are two methods that need to be created. The first one to remove all user data from a context, the other to remove user data for a specific user in a list of contexts.

====Delete for a context====

A context is given and all user data (for all users) is to be deleted from the plugin. This will be called when the retention period for the plugin has expired to adhere to the privacy by design requirement.�

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_data_for_all_users_in_context(deletion_criteria $criteria) {
global $DB;
$context = $criteria->get_context();
if (empty($context)) {
return;
}
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid]);
}
</code>

====Delete personal information for a specific user and context(s)====

An ''approved_contextlist'' is given and user data related to that user should either be completely deleted, or overwritten if a structure needs to be maintained. This will be called when a user has requested the right to be forgotten. All attempts should be made to delete this data where practical while still allowing the plugin to be used by other users.

''mod/choice/classes/privacy/provider.php''

<code php>
public static function delete_userdata_for_contexts(approved_contextlist $contextlist) {
global $DB;

if (empty($contextlist->count())) {
return;
}
$userid = $contextlist->get_user()->id;
foreach ($contextlist->get_contexts() as $context) {
$instanceid = $DB->get_field('course_modules', 'instance', ['id' => $context->instanceid], MUST_EXIST);
$DB->delete_records('choice_answers', ['choiceid' => $instanceid, 'userid' => $userid]);
}
}
</code>

==See also==

* [[Subject Access Request FAQ]]
* [[Privacy API]]
* [[:en:GDPR|GDPR]] in the user documentation

[[Category:GDPR]]

Machine learning backends

2017-12-27T08:06:32Z

Dmonllao: /* Regressor */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @return null
*/
public function delete_output_dir($modeloutputdir);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_regression($uniqueid, \stored_file $dataset, $outputdir);

/**
* Estimates linear values for the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param mixed $outputdir
* @return void
*/
public function estimate($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_regression($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

Machine learning backends

2017-12-27T08:06:20Z

Dmonllao: /* Classifier */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @return null
*/
public function delete_output_dir($modeloutputdir);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

Both these methods and ''Predictor'' methods should be implemented.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

/**
* Train this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_regression($uniqueid, \stored_file $dataset, $outputdir);

/**
* Estimates linear values for the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param mixed $outputdir
* @return void
*/
public function estimate($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_regression($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

Machine learning backends

2017-12-27T08:05:36Z

Dmonllao: /* Regressor */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @return null
*/
public function delete_output_dir($modeloutputdir);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

/**
* Train this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_regression($uniqueid, \stored_file $dataset, $outputdir);

/**
* Estimates linear values for the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param mixed $outputdir
* @return void
*/
public function estimate($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor regression model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_regression($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

Machine learning backends

2017-12-27T08:05:14Z

Dmonllao: /* Classifier */

== Introduction ==

Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learning training, prediction and models evaluation. May be good that you also read [https://docs.moodle.org/dev/Analytics_API Analytics API] to read some concept definitions, how these concepts are implemented in Moodle and how machine learning backend plugins fit into the analytics API.

The communication between machine learning backends and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Backends included in Moodle core ==

The '''PHP backend''' is the default predictions processor as it is written in PHP and do not have any external dependencies. It is using logistic regression.

The '''Python backend''' requires ''python'' binary (either python 2 or python 3) and [https://pypi.python.org/pypi?name=moodlemlbackend&version=0.0.5&:action=display moodlemlbackend python package] which is maintained by Moodle HQ. It is based on [https://www.tensorflow.org/ Google's tensorflow library] and it is using a feed-forward neural network with 1 single hidden layer. ''moodlemlbackend'' package does store model performance information that can be visualised using [https://www.tensorflow.org/get_started/summaries_and_tensorboard tensorboard]. Information generated during models evaluation is available through the models management page, under each model ''Actions > Log'' menu. ''moodlemlbackend'' source code is available in https://github.com/moodlehq/moodle-mlbackend-python.

'''Python backend is recommended over the PHP''' as it is able to predict more accurately than the PHP backend and it is faster.

== Interfaces ==

A summary of these interfaces purpose:
* Evaluate a provided prediction model
* Train machine learning algorithms with the existing site data
* Predict targets based on previously trained algorithms

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, ''classifiers'' and ''regressors''. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

You can use '''is_ready''' to check that the backend is available.

/**
* Is it ready to predict?
*
* @return bool
*/
public function is_ready();

'''clear_model''' and '''delete_output_dir''' purpose is to clean up stuff created by the machine learning backend.

/**
* Delete all stored information of the current model id.
*
* This method is called when there are important changes to a model,
* all previous training algorithms using that version of the model
* should be deleted.
*
* @param string $uniqueid The site model unique id string
* @param string $modelversionoutputdir The output dir of this model version
* @return null
*/
public function clear_model($uniqueid, $modelversionoutputdir);

/**
* Delete the output directory.
*
* This method is called when a model is completely deleted.
*
* @param string $modeloutputdir The model directory id (parent of all model versions subdirectories).
* @return null
*/
public function delete_output_dir($modeloutputdir);

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

/**
* Train this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function train_classification($uniqueid, \stored_file $dataset, $outputdir);

/**
* Classifies the provided dataset samples.
*
* @param string $uniqueid
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function classify($uniqueid, \stored_file $dataset, $outputdir);

/**
* Evaluates this processor classification model using the provided supervised learning dataset.
*
* @param string $uniqueid
* @param float $maxdeviation
* @param int $niterations
* @param \stored_file $dataset
* @param string $outputdir
* @return \stdClass
*/
public function evaluate_classification($uniqueid, $maxdeviation, $niterations, \stored_file $dataset, $outputdir);

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

Machine learning backends

2017-12-27T07:43:00Z

Dmonllao: /* Predictor */

Machine learning backends

2017-12-27T07:42:46Z

Dmonllao: /* Predictor */

Analytics API

2017-12-27T07:14:23Z

Dmonllao: /* Time-splitting method (core_analytics\local\time_splitting\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

The first thing a target must define is the analyser class that it will use. The analyser class is specified in '''get_analyser_class'''.

/**
* Returns the analyser class that should be used along with this target.
*
* @return string The full class name as a string
*/
abstract public function get_analyser_class();

'''is_valid_analysable''' and '''is_valid_sample''' are used to discard elements that are not valid for your target.

/**
* Allows the target to verify that the analysable is a good candidate.
*
* This method can be used as a quick way to discard invalid analysables.
* e.g. Imagine that your analysable don't have students and you need them.
*
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return true|string
*/
public function is_valid_analysable(\core_analytics\analysable $analysable, $fortraining = true);

/**
* Is this sample from the $analysable valid?
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return bool
*/
public function is_valid_sample($sampleid, \core_analytics\analysable $analysable, $fortraining = true);

'''calculate_sample''' is the method that calculates the target value.

/**
* Calculates this target for the provided samples.
*
* In case there are no values to return or the provided sample is not applicable just return null.
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param int|false $starttime Limit calculations to start time
* @param int|false $endtime Limit calculations to end time
* @return float|null
*/
protected function calculate_sample($sampleid, \core_analytics\analysable $analysable, $starttime = false, $endtime = false);

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

Time-splitting methods are useful to define when the analytics API will train the predictions processor and when it will generate predictions. As explained above in [[Analytics_API#Time_splitting_methods]], they define time ranges based on analysable elements start and end timestamps.

The base class is '''\core_analytics\local\time_splitting\base'''; if what you are after is to split the analysable duration in equal parts or in cumulative parts you can extend '''\core_analytics\local\time_splitting\equal_parts''' or '''\core_analytics\local\time_splitting\accumulative_parts''' instead.

'''define_ranges''' is the main method to implement and its values mostly depend on the current analysable element (available in '''$this->analysable'''). An array of time ranges should be returned, each of these ranges should contain 3 attributes: A start time ('start') and an end time ('end') that will be passed to indicators so they can limit the amount of activity logs they read; the 3rd attribute is 'time', which value will determine when the range will be executed.

/**
* Define the time splitting methods ranges.
*
* 'time' value defines when predictions are executed, their values will be compared with
* the current time in ready_to_predict
*
* @return array('start' => time(), 'end' => time(), 'time' => time())
*/
protected function define_ranges();

A name and description should also be specified:

/**
* Returns a lang_string object representing the name for the time splitting method.
*
* Used as column identificator.
*
* If there is a corresponding '_help' string this will be shown as well.
*
* @return \lang_string
*/
public static function get_name() : \lang_string;

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-27T06:52:17Z

Dmonllao: /* Target (core_analytics\local\target\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

The first thing a target must define is the analyser class that it will use. The analyser class is specified in '''get_analyser_class'''.

/**
* Returns the analyser class that should be used along with this target.
*
* @return string The full class name as a string
*/
abstract public function get_analyser_class();

'''is_valid_analysable''' and '''is_valid_sample''' are used to discard elements that are not valid for your target.

/**
* Allows the target to verify that the analysable is a good candidate.
*
* This method can be used as a quick way to discard invalid analysables.
* e.g. Imagine that your analysable don't have students and you need them.
*
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return true|string
*/
public function is_valid_analysable(\core_analytics\analysable $analysable, $fortraining = true);

/**
* Is this sample from the $analysable valid?
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param bool $fortraining
* @return bool
*/
public function is_valid_sample($sampleid, \core_analytics\analysable $analysable, $fortraining = true);

'''calculate_sample''' is the method that calculates the target value.

/**
* Calculates this target for the provided samples.
*
* In case there are no values to return or the provided sample is not applicable just return null.
*
* @param int $sampleid
* @param \core_analytics\analysable $analysable
* @param int|false $starttime Limit calculations to start time
* @param int|false $endtime Limit calculations to end time
* @return float|null
*/
protected function calculate_sample($sampleid, \core_analytics\analysable $analysable, $starttime = false, $endtime = false);

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-27T06:45:10Z

Dmonllao:

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

As specified in https://docs.moodle.org/dev/Analytics_API#Target_.28core_analytics.5Clocal.5Ctarget.5Cbase.29.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-13T10:43:16Z

Dmonllao: /* Analyser (core_analytics\local\analyser\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return data associated with the sample ids they provide. This is important for 2 reasons:
* The data they provide alongside the sample origin is used to filter out indicators that are not related to what this analyser analyses. ''e.g. courses analysers do provide courses and information about courses, but not information about users, a '''is user profile complete''' indicator will require the user object to be available. A model using a courses analyser will not be able to use the '''is user profile complete''' indicator.
* The data included here is cached in PHP static vars; on one hand this reduces the amount of db queries indicators need to perform. On the other hand, if not well balanced, it can lead to PHP memory issues.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

''To be completed...''

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-13T10:35:55Z

Dmonllao: /* Indicator (core_analytics\local\indicator\base) */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return the same samples data associated with the sample ids so you may be interested in having a method that they can both call.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

Indicators should generally extend one of these 3 classes, depending on the values they can return: ''core_analytics\local\indicator\binary'' for '''yes/no''' indicators, ''core_analytics\local\indicator\linear'' for indicators that return linear values and ''core_analytics\local\indicator\discrete'' for categorised indicators.

You can use '''required_sample_data''' to specify what your indicator needs to be calculated; you may need a ''user'' object, a ''course'', a ''grade item''... The default implementation does not require anything. Models which analysers do not return the required data will not be able to use your indicator so only list here what you really need. e.g. if you need a grade_grades record mark it as required, but there is no need to require the ''user'' object and the ''course'' as well because you can obtain them from the grade_grades item. It is very likely that the analyser will provide them as well because the principle they follow is to include as much related data as possible but do not flag related objects as required because an analyser may, for example, chose to not include the ''user'' object because it is too big and sites can have memory problems.

/**
* Allows indicators to specify data they need.
*
* e.g. A model using courses as samples will not provide users data, but an indicator like
* "user is hungry" needs user data.
*
* @return null|string[] Name of the required elements (use the database tablename)
*/
public static function required_sample_data() {
return null;
}

A single method must be implemented, '''calculate_sample'''. Most indicators make use of $starttime and $endtime to restrict the time period they consider for their calculations (e.g. read actions during $starttime - $endtime period) but some indicators may not need to apply any restriction (e.g. does this user have a user picture and profile description?) ''self::MIN_VALUE'' is -1 and ''self::MAX_VALUE'' is 1. We do not recommend changing these values.

/**
* Calculates the sample.
*
* Return a value from self::MIN_VALUE to self::MAX_VALUE or null if the indicator can not be calculated for this sample.
*
* @param int $sampleid
* @param string $sampleorigin
* @param integer $starttime Limit the calculation to this timestart
* @param integer $endtime Limit the calculation to this timeend
* @return float|null
*/
abstract protected function calculate_sample($sampleid, $sampleorigin, $starttime, $endtime);

Note that performance here is critical as it runs once for each sample and for each range in the time-splitting method; some tips:
* To avoid performance issues or repeated db queries analyser classes provide information about the samples that you can use for your calculations to save some database queries. You can retrieve information about a sample with '''$user = $this->retrieve('user', $sampleid)'''. ''retrieve()'' will return false if the requested data is not available.
* You can also overwrite ''fill_per_analysable_caches'' method if necessary (keep in mind though that PHP memory is unlimited).
* Indicator instances are reset for each analysable and time range that is processed. This helps keeping the memory usage acceptably low and prevents hard-to-trace caching bugs.

==== Target (core_analytics\local\target\base) ====

''To be completed...''

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-13T10:06:15Z

Dmonllao: /* Design */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation.

It is also important to mention that analysable elements should be lazy loaded, otherwise you may have PHP memory issues. The reason is that analysers load all analysable elements in the site to calculate which ones are going to be calculated next (skipping the ones processed recently and stuff like that) You can take core_analytics\course as an example.

Methods to implement:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

'''get_start''' and '''get_end''' define the start and end times that indicators will use for their calculations.

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

'''get_analysables''' returns the whole list of analysable elements in the site. Each model will later be able to discard analysables that do not match their expectations. ''e.g. if your model is only interested in quizzes with a time close the analyser will return all quizzes, your model will exclude the ones without a time close. This approach is supposed to make analysers more reusable.''

/**
* Returns the list of analysable elements available on the site.
*
* @return \core_analytics\analysable[] Array of analysable elements using the analysable id as array key.
*/
abstract public function get_analysables();

'''get_all_samples''' and '''get_samples''' should return the same samples data associated with the sample ids so you may be interested in having a method that they can both call.

/**
* This function returns this analysable list of samples.
*
* @param \core_analytics\analysable $analysable
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract protected function get_all_samples(\core_analytics\analysable $analysable);

/**
* This function returns the samples data from a list of sample ids.
*
* @param int[] $sampleids
* @return array array[0] = int[] (sampleids) and array[1] = array (samplesdata)
*/
abstract public function get_samples($sampleids);

'''get_sample_analysable''' method is executing during prediction:

/**
* Returns the analysable of a sample.
*
* @param int $sampleid
* @return \core_analytics\analysable
*/
abstract public function get_sample_analysable($sampleid);

The sample origin is the moodle database table that uses the sample id as primary key.

/**
* Returns the sample's origin in moodle database.
*
* @return string
*/
abstract public function get_samples_origin();

'''sample_access_context''' associates a context to a sampleid. This is important because this sample predictions will only be available for users with ''moodle/analytics:listinsights'' capability in that context.

/**
* Returns the context of a sample.
*
* @param int $sampleid
* @return \context
*/
abstract public function sample_access_context($sampleid);

'''sample_description''' is used to display samples in ''Insights'' report:

/**
* Describes a sample with a description summary and a \renderable (an image for example)
*
* @param int $sampleid
* @param int $contextid
* @param array $sampledata
* @return array array(string, \renderable)
*/
abstract public function sample_description($sampleid, $contextid, $sampledata);

==== Indicator (core_analytics\local\indicator\base) ====

''To be completed...''

==== Target (core_analytics\local\target\base) ====

''To be completed...''

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-12-13T09:44:13Z

Dmonllao: /* Design */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

[https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends] is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Interfaces ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors. Analytics API will be able to find them as long as they follow the namespace conventions described below.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

''Note that this section do not include Machine learning backend interfaces, they are available in https://docs.moodle.org/dev/Machine_learning_backends#Interfaces.

==== Analysable (core_analytics\analysable) ====

Analysables are those elements in Moodle that contain samples (read related comments above in [[#Analyser]]). In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity. Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''.

They list of methods that need to be implemented is quite simple and does not require much explanation. get_start() and get_end() are probably the only methods worth commenting about; they define the start and end times that indicators will use for their calculations:

/**
* The analysable unique identifier in the site.
*
* @return int.
*/
public function get_id();

/**
* The analysable human readable name
*
* @return string
*/
public function get_name();

/**
* The analysable context.
*
* @return \context
*/
public function get_context();

/**
* The start of the analysable if there is one.
*
* @return int|false
*/
public function get_start();

/**
* The end of the analysable if there is one.
*
* @return int|false
*/
public function get_end();

==== Analyser (core_analytics\local\analyser\base) ====

''To be completed...''

==== Indicator (core_analytics\local\indicator\base) ====

''To be completed...''

==== Target (core_analytics\local\target\base) ====

''To be completed...''

==== Time-splitting method (core_analytics\local\time_splitting\base) ====

''To be completed...''

==== Calculable (core_analytics\calculable) ====

Leaving this interface for the end because it is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Machine learning backends

2017-12-13T09:28:30Z

Dmonllao: /* Interfaces */

Machine learning backends

2017-12-13T09:22:08Z

Dmonllao: /* Backends included in Moodle core */

Analytics API

2017-12-13T08:35:39Z

Dmonllao: /* Machine learning backends */

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

Documentation available in [https://docs.moodle.org/dev/Machine_learning_backends Machine learning backends].

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''
* '''\core_analytics\analysable'''

=== Interfaces ===

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, classifiers and regressors. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Machine learning backends

2017-12-13T08:34:49Z

Dmonllao: Created page with " == Introduction == Machine learning backends process the datasets generated from the indicators and targets calculated by the Analytics API. They are used for machine learni..."

Plugin types

2017-12-13T08:19:30Z

Dmonllao: /* List of Moodle plugin types */

{{Plugins development}}

The M in Moodle stands for modular. The easiest and most maintainable way to add new functionality to Moodle is by writing one of these types of plugin.

== List of Moodle plugin types ==

{| class="nicetable"
|-
! Plugin type
! Component name ([[Frankenstyle]])
! Moodle path
! Description
! Moodle versions
|-
| [[Activity modules]]
| mod
| /mod
| Activity modules are essential types of plugins in Moodle as they provide activities in courses. For example: Forum, Quiz and Assignment.
| 1.0+
|-
| [[Antivirus plugins]]
| antivirus
| /lib/antivirus
| Antivirus scanner plugins provide functionality for virus scanning user uploaded files using third-party virus scanning tools in Moodle. For example: ClamAV.
| 3.1+
|-
| [[Assign_submission_plugins|Assignment submission plugins]]
| assignsubmission
| /mod/assign/submission
| Different forms of assignment submissions
| 2.3+
|-
| [[Assign_feedback_plugins|Assignment feedback plugins]]
| assignfeedback
| /mod/assign/feedback
| Different forms of assignment feedbacks
| 2.3+
|-
| [[Book tools]]
| booktool
| /mod/book/tool
| Small information-displays or tools that can be moved around pages
| 2.1+
|-
| [[Database fields]]
| datafield
| /mod/data/field
| Different types of data that may be added to the Database activity module
| 1.6+
|-
| [[Database presets]]
| datapreset
| /mod/data/preset
| Pre-defined templates for the Database activity module
| 1.6+
|-
| [[External tool source|LTI sources]]
| ltisource
| /mod/lti/source
| LTI providers can be added to external tools easily through the external tools interface see [https://docs.moodle.org/en/External_tool Documentation on External Tools]. This type of plugin is specific to LTI providers that need a plugin that can register custom handlers to process LTI messages
| 2.7+
|-
| [[File Converters]]
| fileconverter
| /files/converter
| Allow conversion between different types of user-submitted file. For example from .doc to PDF.
| 3.2+
|-
| [[LTI services]]
| ltiservice
| /mod/lti/service
| Allows the implementation of LTI services as described by the IMS LTI specification
| 2.8+
|-
| [[Machine learning backends]]
| mlbackend
| /lib/mlbackend
| Prediction processors for analytics API
| 3.4+
|-
| [[Quiz reports]]
| quiz
| /mod/quiz/report
| Display and analyse the results of quizzes, or just plug miscellaneous behaviour into the quiz module
| 1.1+
|-
| [[Quiz access rules]]
| quizaccess
| /mod/quiz/accessrule
| Add conditions to when or where quizzes can be attempted, for example only from some IP addresses, or student must enter a password first
| 2.2+
|-
| [[SCORM reports]]
| scormreport
| /mod/scorm/report
| Analysis of SCORM attempts
| 2.2+
|-
| [[Workshop grading strategies]]
| workshopform
| /mod/workshop/form
| Define the type of the grading form and implement the calculation of the grade for submission in the [[Workshop]] module
| 2.0+
|-
| [[Workshop allocation methods]]
| workshopallocation
| /mod/workshop/allocation
| Define ways how submissions are assigned for assessment in the [[Workshop]] module
| 2.0+
|-
| [[Workshop evaluation methods]]
| workshopeval
| /mod/workshop/eval
| Implement the calculation of the grade for assessment (grading grade) in the [[Workshop]] module
| 2.0+
|-
| [[Blocks]]
| block
| /blocks
| Small information-displays or tools that can be moved around pages
| 2.0+
|-
| [[Question types]]
| qtype
| /question/type
| Different types of question (e.g. multiple-choice, drag-and-drop) that can be used in quizzes and other activities
| 1.6+
|-
| [[Question behaviours]]
| qbehaviour
| /question/behaviour
| Control how student interact with questions during an attempt
| 2.1+
|-
| [[Question formats|Question import/export formats]]
| qformat
| /question/format
| Import and export question definitions to/from the question bank
| 1.6+
|-
| [[Filters|Text filters]]
| filter
| /filter
| Automatically convert, highlight, and transmogrify text posted into Moodle.
| 1.4+
|-
| [[Editors]]
| editor
| /lib/editor
| Alternative text editors for editing content
| 2.0+
|-
| [[Atto|Atto editor plugins]]
| atto
| /lib/editor/atto/plugins
| Extra functionality for the Atto text editor
| 2.7+
|-
| [[TinyMCE editor plugins]]
| tinymce
| /lib/editor/tinymce/plugins
| Extra functionality for the TinyMCE text editor.
| 2.4+
|-
| [[Enrolment plugins]]
| enrol
| /enrol
| Ways to control who is enrolled in courses
| 2.0+
|-
| [[Authentication plugins]]
| auth
| /auth
| Allows connection to external sources of authentication
| 2.0+
|-
| [[Admin tools]]
| tool
| /admin/tool
| Provides utility scripts useful for various site administration and maintenance tasks
| 2.2+
|-
| [[Log stores]]
| logstore
| /admin/tool/log/store
| Event logs storage back-ends
| 2.7+
|-
| [[Availability conditions]]
| availability
| /availability/condition
| Conditions to restrict user access to activities and sections.
| 2.7+
|-
| [[Calendar types]]
| calendartype
| /calendar/type
| Defines how dates are displayed throughout Moodle
| 2.6+
|-
| [[Messaging consumers]]
| message
| /message/output
| Represent various targets where messages and notifications can be sent to (email, sms, jabber, ...)
| 2.0+
|-
| [[Course formats]]
| format
| /course/format
| Different ways of laying out the activities and blocks in a course
| 1.3+
|-
| [[Data formats]]
| dataformat
| /dataformat
| Formats for data exporting and downloading
| 3.1+
|-
| [[User profile fields]]
| profilefield
| /user/profile/field
| Add new types of data to user profiles
| 1.9+
|-
| [[Reports]]
| report
| /report
| Provides useful views of data in a Moodle site for admins and teachers
| 2.2+
|-
| [[Course reports]]
| coursereport
| /course/report
| Reports of activity within the course
| Up to 2.1 (for 2.2+ see [[Reports]])
|-
| [[Gradebook export]]
| gradeexport
| /grade/export
| Export grades in various formats
| 1.9+
|-
| [[Gradebook import]]
| gradeimport
| /grade/import
| Import grades in various formats
| 1.9+
|-
| [[Gradebook reports]]
| gradereport
| /grade/report
| Display/edit grades in various layouts and reports
| 1.9+
|-
| [[Grading methods|Advanced grading methods]]
| gradingform
| /grade/grading/form
| Interfaces for actually performing grading in activity modules (eg Rubrics)
| 2.2+
|-
| [[MNet services]]
| mnetservice
| /mnet/service
| Allows to implement remote services for the [[MNet]] environment (deprecated, use web services instead)
| 2.0+
|-
| [[Webservice protocols]]
| webservice
| /webservice
| Define new protocols for web service communication (such as SOAP, XML-RPC, JSON, REST ...)
| 2.0+
|-
| [[Repository plugins]]
| repository
| /repository
| Connect to external sources of files to use in Moodle
| 2.0+
|-
| [[Portfolio plugins]]
| portfolio
| /portfolio
| Connect external portfolio services as destinations for users to store Moodle content
| 1.9+
|-
| [[Search engines]]
| search
| /search/engine
| Search engine backends to index Moodle's contents.
| 3.1+
|-
| [[Media players]]
| media
| /media/player
| Pluggable media players
| 3.2+
|-
| [[Plagiarism plugins]]
| plagiarism
| /plagiarism
| Define external services to process submitted files and content
| 2.0+
|-
| [[Cache store]]
| cachestore
| /cache/stores
| Cache storage back-ends.
| 2.4+
|-
| [[Cache locks]]
| cachelock
| /cache/locks
| Cache lock implementations.
| 2.4+
|-
| [[Themes]]
| theme
| /theme
| Change the look of Moodle by changing the the HTML and the CSS.
| 2.0+
|-
| [[Local plugins]]
| local
| /local
| Generic plugins for local customisations
| 2.0+
|-
| [[Assignment types|Legacy assignment types]]
| assignment
| /mod/assignment/type
| Different forms of assignments to be graded by teachers
| 1.x - 2.2
|-
| [[Admin reports|Legacy admin reports]]
| report
| /admin/report
| Provides useful views of data in a Moodle site, for admins only.
| Up to 2.1 (for 2.2+ see [[Reports]])
|}

== Obtaining the list of plugin types known to your Moodle ==

To get the most exact list of types in your version of Moodle, use the following script. Put it to a file in the root directory of your Moodle installation and execute it via command line.

<code php>
<?php
define('CLI_SCRIPT', true);
require('config.php');

$pluginman = core_plugin_manager::instance();

foreach ($pluginman->get_plugin_types() as $type => $dir) {
$dir = substr($dir, strlen($CFG->dirroot));
printf("%-20s %-50s %s".PHP_EOL, $type, $pluginman->plugintype_name_plural($type), $dir);
}
</code>

==Things you can find in all plugins==

Although there are many different types of plugin, there are some things that work the same way in all plugin types, and we have [[Things that work the same in all plugin types|a page that describes them]].

Additionally you probably want to look at the page [[Plugin files]].

== Naming conventions ==

Warning if you have to choose a plugin (directory) name. The name is validated by the method <tt>lib/classes/component.php::is_valid_plugin_name()</tt> with a regexp: <tt>/^[a-z](?:[a-z0-9_](?!__))*[a-z0-9]+$/</tt>. In particular, the minus (-) character is not considered as valid, and the plugin will be silently ignored if the name is not valid.

There is an exception for [[Activity modules|activity modules]] that can not have the underscore in their name for legacy reasons.

== See also ==

* [[Guidelines_for_contributed_code|Guidelines for contributing code]]
* [[Core APIs]]
* [[Frankenstyle]]
* [http://moodle.org/plugins Moodle Plugins directory]
* [[Tutorial]] to help you learn how to write plugins for Moodle from start to finish, while showing you how to navigate the most important developer documentation along the way.

[[Category:Coding guidelines|Plugins]]
[[Category:Plugins]]

Analytics API

2017-11-14T14:56:57Z

Dmonllao:

== Summary ==

The Moodle Analytics API allows Moodle site managers to define prediction models that combine indicators and a target. The target is the event we want to predict. The indicators are what we think will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the prediction accuracy is high enough, Moodle internally trains a machine learning algorithm by using calculations based on the defined indicators within the site data. Once new data that matches the criteria defined by the model is available, Moodle starts predicting the probability that the target event will occur. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested in is prevention of [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out students at risk of dropping out]: Lack of participation or bad grades in previous activities could be indicators, and the target would be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predicts which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows the main components of the analytics API and the interactions between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through, from the data a Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relationships. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even courses on the same site can vary significantly. Moodle core will only include models that have been proven to be good at predicting in a wide range of sites and courses. Moodle 3.4 provides two built-in models:

* [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out]
* [https://docs.moodle.org/34/en/Analytics#No_teaching No teaching]

To diversify the samples and to cover a wider range of cases, the Moodle HQ research team is collecting anonymised Moodle site datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with will obviously better at predicting on the sites of participating institutions, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact [[user:emdalton1|Elizabeth Dalton]] at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

The following definitions are included for people not familiar with machine learning concepts:

=== Training ===

This is the process to be run on a Moodle site before being able to predict anything. This process records the relationships found in site data from the past so the analytics system can predict what is likely to happen under the same circumstances in the future. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for, and where in the Moodle data to look. A sample is a set of calculations we make using a collection of Moodle site data. These samples are unrelated to testing data or phpunit data, and they are identified by an id matching the data element on which the calculations are based. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on that element. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. See [[Analytics_API#Analyser]] for more information on how to use analyser classes to define what is a sample.

=== Prediction model ===

As explained above, a prediction model is a combination of indicators and a target. System models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relationship between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all of a model's related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing large quantities of data to make accurate predictions. There are obvious events that different stakeholders may be interested in knowing that we can easily calculate. These *Static model* predictions are directly calculated based on indicator values. They are based on the assumptions defined in the target, but they should still be based on indicators so all these indicators can still be reused across different prediction models. For this reason, static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of possible static models:
* [https://docs.moodle.org/en/Analytics#No_teaching Courses without teaching activity]
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

Moodle could already generate notifications for the examples above, but there are some benefits on doing it using the Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as the analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related actions.
* The Analytics API tracks user actions after viewing the predictions, so we can know if insights result in actions, which insights are not useful, etc. User responses to insights could themselves be defined as an indicator.

=== Analyser ===

Analysers are responsible for creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers that you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the work. It contains a key abstract method, ''get_all_samples()''. This method is what defines the sample unique identifier across the site. Analyser classes are also responsible of including all site data related to that sample id; this data will be used when indicators are calculated. e.g. A sample id ''user enrolment'' would include data about the ''course'', the course ''context'' and the ''user''. Samples are nothing by themselves, just a list of ids with related data. They are used in calculations once they are combined with the target and the indicator classes.

Other analyser class responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser, there is an important non-obvious fact you should know about: for scalability reasons, all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. This is for performance reasons: depending on the sites' size it could take hours to complete the analysis of the entire site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses), '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site) or create your own analyser for activities, categories or any other Moodle entity.

=== Target ===

Targets are the key element that defines the model. As a PHP class, targets represent the event the model is attempting to predict (the [https://en.wikipedia.org/wiki/Dependent_and_independent_variables dependent variable in supervised learning]). They also define the actions to perform depending on the received predictions.

Targets depend on analysers, because analysers provide them with the samples they need. Analysers are separate entities from targets because analysers can be reused across different targets. Each target needs to specify which analyser it is using. Here are a few examples to clarify the difference between analysers, samples and targets:

* '''Target''': 'students at risk of dropping out'. '''Analyser provides sample''': 'course enrolments'
* '''Target''': 'spammer'. '''Analyser provides sample''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides sample''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides sample''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression, but the machine learning backends included in core do not yet support multiclass classification or regression, so only binary classifications will be initially fully supported. See MDL-59044 and MDL-60523 for more information.

Although there is no technical restriction against using core targets in your own models, in most cases each model will implement a new target. One possible case in which targets might be reused would be to create a new model using the same target and a different sets of indicators, for A/B testing

==== Insights ====

Another aspect controlled by targets is insight generation. Insights represent predictions made about a specific element of the sample within the context of the analyser model. This context will be used to notify users with '''moodle/analytics:listinsights''' capability (the teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction. In cases like ''[https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out]'' the actions can be things like sending a message to the student, viewing the student's course activity report, etc.

=== Indicator ===

Indicator PHP classes are responsible for calculating indicators (predictor value or [https://en.wikipedia.org/wiki/Dependent_and_independent_variables independent variable in supervised learning]) using the provided sample. Moodle core includes a set of indicators that can be used in your models without additional PHP coding (unless you want to extend their functionality).

Indicators are not limited to a single analyser like targets are. This makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and an ''enrolment'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', and the name of the indicator would change according to that. For example, ''User posts in any forum'' could be used in a user-based model like ''Inactive users'' and in any other model where the analyser provides ''user'' data; ''Posts in any of the course forums'' could be used in a course-based model like ''Low participation courses.''

The calculated value can go from -1 (minimum) to 1 (maximum). This requirement prevents the creation of "raw number" indicators like ''absolute number of write actions,'' because we must limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity. Raw counts of an event like "posts to a forum" must be calculated in a proportion of an expected number of posts. There are several ways of doing this. One is to define a minimum desired number of events, e.g. 3 posts in a forum represents "some" activity, 6 posts represents adequate activity, and 10 or more posts represents the maximum expected activity. Another way is to compare the number of events per individual user to the mean or median value of events by all users in the same context, using statistical values. For example, a value of 0 would represent that the student posted the same number of posts as the mean of all student posts in that context; a value of -1 would indicate that the student is 2 or 3 standard deviations below the mean, and a +1 would indicate that the student is 2 or 3 standard deviations above the mean. ''(Note that this kind of comparative calculation has implications in pedagogy: it suggests that there is a ranking of students from best to worst, rather than a defined standard all students can reach.)''

=== Time splitting methods ===

A time splitting method is what defines when the system will calculate predictions and the portion of activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample. This is relatively simple. Things get more complicated when we want to predict what will happen in future. For example, predictions about [https://docs.moodle.org/en/Students_at_risk_of_dropping_out Students at risk of dropping out] are not useful once the course is over or when it is too late for any intervention.

Calculations involving time ranges can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependent indicators within the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course into time ranges: in weeks, quarters, 8 parts, ten parts (tenths), ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (each one inclusive from the beginning of the course) or only from the start of the time range.

The time-splitting methods included in Moodle 3.4 assume that there is a fixed start and end date for each course, so the course can be divided into segments of equal length. This allows courses of different lengths to be included in the same prediction model, but makes these time-splitting methods useless for courses without fixed start or end dates, e.g. self-paced courses. These courses might instead use fixed time lengths such as weeks to define the boundaries of prediction calculations.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''
* '''\core_analytics\analysable'''

=== Interfaces ===

==== Predictor ====

This is the basic interface to be implemented by machine learning backends. Two main types are, classifiers and regressors. We provide the ''Regressor'' interface but it is not currently implemented by core Machine learning backends. Both of these are supervised algorithms. Each type includes methods to train, predict and evaluate datasets.

===== Classifier =====

A [https://en.wikipedia.org/wiki/Statistical_classification classifier] sorts input into two or more categories, based on analysis of the indicators. This is frequently used in binary predictions, e.g. course completion vs. dropout. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support classification. It extends the ''Predictor'' interface.

===== Regressor =====

A [https://en.wikipedia.org/wiki/Regression_analysis regressor] predicts the value of an outcome (or dependent) variable based on analysis of the indicators. This value is linear, such as a final grade in a course or the likelihood a student is to pass a course. This machine learning algorithm is "supervised": It requires a training data set of elements whose classification is known (e.g. courses in the past with a clear definition of whether the student has dropped out or not). This is an interface to be implemented by machine learning backends that support regression. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are those elements in Moodle that contain samples. In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element, e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Both indicators and targets must implement this interface. It defines the data element to be used in calculations, whether as independent (indicator) or dependent (target) variables.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

Start by defining what you want to predict (the target) and the subjects of these predictions (the samples). You can find the descriptions of these concepts above. The API can be used for all kinds of models, though if you want to predict something like "student success," this definition should probably have some basis in pedagogy. (For example, the included model [https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] is based on the Community of Inquiry theoretical framework, and attempts to predict that students will complete a course based on indicators designed to represent the three components of the CoI framework (teaching presence, social presence, and cognitive presence)). Start by being clear about how the target will be defined. It must be trained using known examples. This means that if, for example, you want to predict the final grade of a course per student, the courses being used to train the model must include accurate final grades.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simpler than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, though processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts).
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (though this is only a default behaviour you can overwrite in your target).

Note that the existing time splitting methods are proportional to the length of the course, e.g. quarters, tenths, etc. This allows courses with different lengths to be included in the same sample, but requires courses to have defined start and end dates. Other time splitting methods are possible which do not depend on the defined length of the course, e.g. weekly. These would be more appropriate for self-paced courses without fixed start and end dates.

You do not need to require a single time splitting method at this stage, and they can be changed whenever the model is trained. You do need to define whether the model will make a single prediction or multiple predictions per analysable.

=== Create the target ===

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''. Technically targets could be reused between models although it is not very recommendable and you should focus instead in having a single model with a single set of indicators that work together towards predicting accurately. The only valid use case I can think of for models in production is using different time-splitting methods for it although, again, the proper way to solve this is by using a single time-splitting method specific for your needs.

=== Create the model ===

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target: classify users as spammers
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators: two different indicators that predict that the user is a spammer
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

[https://docs.moodle.org/34/en/Students_at_risk_of_dropping_out Students at risk of dropping out] (based on student's activity, included in [https://docs.moodle.org/34/en/Analytics Moodle 3.4])
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-11-03T16:25:18Z

Dmonllao: /* Extension points */

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''
* '''\core_analytics\analysable'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators.
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-11-03T16:23:59Z

Dmonllao: /* Create the target */

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

You can create the model by specifying at least its target and, optionally, a set of indicators and a time splitting method:

// Instantiate the target
$target = \core_analytics\manager::get_target('\mod_yours\analytics\target\spammer_users');

// Instantiate indicators.
$indicator1 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_straight_after_new_account_created');
$indicator2 = \core_analytics\manager::get_indicator('\mod_yours\analytics\indicator\posts_contain_important_viagra');
$indicators = array($indicator1->get_id() => $indicator1, $indicator2->get_id() => $indicator2);

// Create the model.
$model = \core_analytics\model::create($target, $indicators, '\core\analytics\time_splitting\single_range');

Models are disabled by default because you may be interested in evaluating how good the model is at predicting before enabling them. You can enable models using Moodle UI or the analytics API:

$model->enable();

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-10-27T10:46:35Z

Dmonllao: /* API usage examples */

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

''To be completed...''

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Late assignment submissions (based on student's activity)
* '''Time splitting:''' close to the analysable end date (1 week before, X days before...)
* '''Analyser samples:''' assignment submissions (analysable elements are activities)
* '''target::is_valid_analysable'''
** For prediction = the assignment is open for submissions
** For training = past assignment due date
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on previous students activity

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-10-27T10:41:55Z

Dmonllao: /* Indicators */

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

''To be completed...''

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to create your own indicators specific to the target you want to predict.

You can use "'Site administration > Analytics > Analytics models"' to see the list of available indicators and add some of them to your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-10-27T10:27:22Z

Dmonllao:

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a '''\context''' and ''get_start()'' and ''get_end()'' methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's ''is_valid_sample()'' method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's ''is_valid_analysable()'' method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

''To be completed...''

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to add your own indicators.

You can use "Analytic models" tool to see the list of indicators and edit your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics API

2017-10-27T10:25:56Z

Dmonllao:

== Summary ==

Analytics API allow Moodle sites managers to define prediction models that combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction of the target. Moodle is able to evaluate these models and, if the predictions accuracy is good enough, Moodle internally trains a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle starts predicting what is most likely to happen. Targets are free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.

An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle calculates these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.

=== API components ===

This diagram shows how the main components of the analytics API interact between them.

[[File:Inspire_API_components.png]]

=== Data flow ===

The diagram below shows the different stages data goes through. From the data any Moodle site contains to actionable insights.

[[File:Inspire_data_flow.png]]

=== API classes diagram ===

This is a summary of the API classes and their relations. It groups the different parts of the framework that can be extended by 3rd parties to create your own prediction models.

[[File:Analytics_API_classes_diagram_(summary).svg]]

(Click on the image to expand, it is an SVG)

== Built-in models ==

People use Moodle in very different ways and even same site courses can vary significantly. Moodle core should only include models that have been proven to be good at predicting in a wide range of sites and courses.

To diversify the samples and to cover a wider range of cases Moodle HQ research team is collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle is shipped with are obviously better at predicting on these institutions sites, although some other datasets are used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [[mailto:elizabeth@moodle.com elizabeth@moodle.com]] to get information about the process.

Even if the models we include in Moodle core are already trained by Moodle HQ, each different site will continue training that site machine learning algorithms with its own data, which should lead to better prediction accuracy over time.

== Concepts ==

=== Training ===

Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.

=== Samples ===

The machine learning backends we use to make predictions need to know what sort of patterns to look for. We use Moodle sites data for it, a sample is a set of calculations we make using the site data. These samples are unrelated to testing data, phpunit and stuff like that and they are identified by an id. The id of a sample can be any Moodle entity id: a course, a user, an enrolment, a quiz attempt, etc. and the calculations the sample contains depend on it. Each type of Moodle entity used as a sample helps develop the predictions that involve that kind of entity. For example, samples based on Quiz attempts will help develop the potential insights that the analytics might offer that are related to the Quiz attempts by a particular group of students. Further info in Analytics_API#Analyser as analyser classes define what is a sample.

=== Prediction model ===

As explained above a prediction model is a combination of indicators and a target. Your system models can be viewed in '''Site administration > Analytics > Analytics models'''.

The relation between indicators and targets is stored in ''analytics_models'' database table.

The class '''\core_analytics\model''' manages all models related actions. ''evaluate()'', ''train()'' and ''predict()'' forward the calculated indicators to the machine learning backends. '''\core_analytics\model''' delegates all heavy processing to analysers and machine learning backends. It also manages prediction models evaluation logs.

'''\core_analytics\model''' class is not expected to be extended.

==== Static models ====

Some prediction models do not need a powerful machine learning algorithm behind them processing tons of data to make more or less accurate predictions. There are obvious things different stakeholders may be interested in knowing that we can easily calculate. These *Static models* predictions are directly based on indicators calculations. They are based on the assumptions defined in the target but they should still be based on indicators so all these indicators can still be reused across different prediction models, therefore static models are not editable through '''Site administration > Analytics > Analytics models''' user interface.

Some examples of static models:
* Courses without teaching activity
* Courses with students submissions requiring attention and no teachers accessing the course
* Courses that started 1 month ago and never accessed by anyone
* Students that have never logged into the system
* ....

These cases are nothing new and we could be already generating notifications for the examples above but there are some benefits on doing it using Moodle analytics API:
* Everything is easier and faster to code from a dev point of view as analytics subsystem provides APIs for everything
* New Indicators will be part of the core indicators pool that researchers (and 3rd party developers in general) can reuse in their own models
* Existing core indicators can be reused as well (the same indicators used for insights that depend on machine learning backends)
* Notifications are displayed using the core insights system, which is also responsible of sending the notifications and all related stuff.
* Analytics API tracks user actions after viewing the predictions, so we can know if insights derive in actions, which insights are not useful...

=== Analyser ===

Analysers are responsible of creating the dataset files that will be sent to the machine learning processors. They are coded as PHP classes. Moodle core includes some analysers you can use in your models.

The base class '''\core_analytics\local\analyser\base''' does most of the stuff, it contains a key abstract method though, ''get_all_samples()'', this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themselves, just a list of ids with related data, they make sense once they are combined with the target and the indicator classes.

Other analyser classes responsibilities:
* Define the context of the predictions
* Discard invalid data
* Filter out already trained samples
* Include the time factor (time range processors, explained below)
* Forward calculations to indicators and target classes
* Record all calculations in a file
* Record all analysed sample ids in the database

If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site. This is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend '''\core_analytics\local\analyser\by_course''' (your analyser will process a list of courses) or '''\core_analytics\local\analyser\sitewide''' (your analyser will receive just one analysable element, the site)

=== Target ===

Target are the key element that defines the model. Targets are coded as PHP classes and they define what we want to predict and calculate it across the site. It also defines the actions to perform depending on the received predictions.

Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because analysers can be reused across different targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
* '''Target''': 'students at risk of dropping out'. '''Analyser provides''': 'course student'
* '''Target''': 'spammer'. '''Analyser provides''': 'site users'
* '''Target''': 'ineffective course'. '''Analyser provides''': 'courses'
* '''Target''': 'difficulties to pass a specific quiz'. '''Analyser provides''': 'quiz attempts in a specific quiz'

A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.

The API supports binary classification, multiclass classification and regression but the machine learning backends included in core do not yet support multiclass classification nor regression so only binary classifications will be initially fully supported. See [https://tracker.moodle.org/browse/MDL-59044 MDL-59044] and [https://tracker.moodle.org/browse/MDL-60523 https://tracker.moodle.org/browse/MDL-60523] for more info.

Although there is no technical restriction to directly use core targets in your own models it does not make much sense in most of the cases. One possible case would be to create a new 2 models using the same target and different sets of indicators.

==== Insights ====

Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with '''moodle/analytics:listinsights''' capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.

A set of suggested actions will be available for each prediction, in cases like ''Students at risk of dropping out'' the actions can be things like to send a message to the student, to view their course activity report...

=== Indicator ===

Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample. Moodle core includes a set of indicators that can be used in your models without any PHP coding (unless you want to extend their functionality)

Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse in different models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named ''Posts in any forum'' could be initially coded for a ''Shy students in a course'' target; this target would use ''course enrolments'' analyser, so the indicator developer knows that a ''course'' and a ''user'' will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like ''courses'' or ''users'. In this case the developer can chose to require ''course'' '''or''' ''user'', the name of the indicator would change according to that: ''User posts in any forum'', which could be used in models like ''Inactive users'' or ''Posts in any of the course forums'', which could be used in models like ''Low participation courses''

The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like ''absolute number of write actions'' because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.

=== Time splitting methods ===

A time splitting methods is what defines when you will get predictions and the activity logs that will be considered for those predictions. They are coded as PHP classes and Moodle core includes some time splitting methods you can use in your models.

In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.

Calculations in time ranges is can be a challenging aspect of some prediction models. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.

There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the course) or only from the start of the time range.

=== Machine learning backends ===

They process the datasets generated from the calculated indicators and targets. They are a new plugin type with a common interface:

* Evaluate a provided prediction model
* Train a machine learning algorithm with the existing site data
* Predict targets based on previously trained algorithms

The communication between prediction processors and Moodle is through files because the code that will process the dataset can be written in PHP, in Python, in other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.

== Design ==

The system is designed as a Moodle subsystem and API. It lives in '''analytics/'''. All analytics base classes are located here.

Machine learning backends is a new Moodle plugin type. They are stored in '''lib/mlbackend'''.

Uses of the analytics API are located in different Moodle components, being core ('''lib/classes/analytics''') the component that hosts general purpose uses of the API.

=== Extension points ===

This API aims to be as extendable as possible. Any moodle component, including third party plugins, is be able to define indicators, targets, analysers and time splitting processors.

An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.

Moodle components (core subsystems, core plugins and 3rd party plugins) will be able to add and/or redefine any of the entities involved in all the data modeling process.

Some of the base classes to extend or follow as example:
* '''\core_analytics\local\analyser\base'''
* '''\core_analytics\local\time_splitting\base'''
* '''\core_analytics\local\indicator\base'''
* '''\core_analytics\local\target\base'''

=== Interfaces ===

==== Predictor ====

Basic interface to be implemented by machine learning backends, pretty useless by itself (continue reading below)

===== Classifier =====

Interface to be implemented by machine learning backends that support classification; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

===== Regressor =====

Interface to be implemented by machine learning backends that support regression; it includes methods to train, predict and evaluate datasets. It extends ''Predictor'' interface.

==== Analysable ====

Analysable items are analysed by analysers :P In most of the cases an analysable will be a course, although it can also be the site or any other Moodle element. e.g. an activity.

Moodle core include two analysers '''\core_analytics\course''' and '''\core_analytics\site'''. They need to provide an id, a name, a \context and get_start() and get_end() methods. Read related comments above in [[#Analyser]].

==== Calculable ====

Indicators and targets must implement this interface.

It is already implemented by '''\core_analytics\local\indicator\base''' and '''\core_analytics\local\target\base''' but you can still code targets or indicators from the '''\core_analytics\calculable''' base if you need more control.

== How to create a model ==

=== Define the problem ===

To define what you want to predict (the target) and the subjects of these predictions (the samples) is the best way to start. You can find the descriptions of these concepts above.

=== How many predictions for each sample? ===

The next decision should be how many predictions you want to get for each sample (e.g. just one prediction before the course starts or a prediction every week). A single prediction for each sample is simplier than multiple predictions at different points in time in terms of how deep into the API you will need to go to code it.

These are not absolute statements, but in general:
* If you want a single prediction for each sample at a specific point in time you can reuse a sitewide analyser or define your own and control samples validity through your target's is_valid_sample() method.
* If you want multiple predictions at different points in time for each sample reuse an analysable element or define your own (extending '''\core_analytics\analysable''') and reuse or define your own analyser to retrieve these analysable elements. You can control analysers validity through your target's is_valid_analysable() method.
* If you want predictions at activity level use a "by_course" analyser as otherwise you may have scalability problems (imagine storing in memory calculations for each grade_grades record in your site, processing elements by courses help as we clean memory after each course is processed)

This decision is important because:
* Time splitting methods are applied to analysable time start and time end (e.g. '''Quarters''' can split the duration of a course in 4 parts)
* Prediction results are grouped by analysable in the admin interface to list predictions.
* By default, insights are notified to users with '''moodle/analytics:listinsights''' capability at analysable level (not so important, just a default behaviour you can overwrite in your target)

==== Create the target ====

Targets must extend '''\core_analytics\local\target\base''' or its main child class '''\core_analytics\local\target\binary'''. Even if Moodle core includes '''\core_analytics\local\target\discrete''' and '''\core_analytics\local\target\linear''' Moodle 3.4 machine learning backends only support binary classifications. So unless you are using your own machine learning backend you need to extend '''\core_analytics\local\target\binary'''.

''To be completed...''

=== Indicators ===

You already know the analyser your target needs (the analyser is what provides samples) and, more or less, what time splitting method may be better for you. You can now select a list of indicators that you think will lead to accurate predictions. You may need to add your own indicators.

You can use "Analytic models" tool to see the list of indicators and edit your model.

== API usage examples ==

This is a list of prediction models and how they could be coded using the Analytics API:

Students at risk of dropping out (based on student's activity, included in Moodle 3.4)
* '''Time splitting:''' quarters, quarters accumulative, deciles, deciles accumulative...
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing courses
** For training = finished courses with activity
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, predictions should be based on finished courses data

Not engaging course contents (based on the course contents)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = the course is close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, a simple look at the course activities should be enough (e.g. are the course activities engaging?)

No teaching (courses close to the start date without a teacher assigned, included in Moodle 3.4)
* '''Time splitting:''' single range
* '''Analyser samples:''' courses (the analysable elements is the site) it would also work using course as analysable
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = course close to the start date
** For training = no training
* '''Based on assumptions (static)''': yes, just check if there are teachers

Spam users (based on suspicious activity)
* '''Time splitting:''' 2 parts, one after 4h since user creation and another one after 2 days (just an example)
* '''Analyser samples:''' users (analysable elements are users)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = 4h or 2 days passed since the user was created
** For training = 2 days passed since the user was created (spammer flag somewhere recorded to calculate target = 1, otherwise no spammer)
* '''Based on assumptions (static)''': no, predictions should be based on users activity logs, although this could also be done as a static model

Students having a bad time
* '''Time splitting:''' quarters accumulative or deciles accumulative
* '''Analyser samples:''' student enrolments (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing and course activity
** For training = finished courses
* '''target::is_valid_sample''' = true
* '''Based on assumptions (static)''': no, ideally it should be based on previous cases

Course not engaging for students (checking student's activity)
* '''Time splitting:''' quarters, quarters accumulative, deciles...
* '''Analyser samples:''' course (analysable elements are courses)
* '''target::is_valid_analysable''' = true
* '''target::is_valid_sample'''
** For prediction = ongoing course
** For training = finished courses
* '''Based on assumptions (static)''': no, it should be based on activity logs

Student will fail a quiz (based on things like other students quiz attempts, the student in other activities, number of attempts to pass the quiz...)
* '''Time splitting:''' single range
* '''Analyser samples:''' student grade on an activity, aka grade_grades id (analysable elements are courses)
* '''target::is_valid_analysable'''
** For prediction = ongoing course with quizzes
** For training = ongoing course with quizzes
* '''target::is_valid_sample'''
** For prediction = more than the X% of the course students attempted the quiz and the sample (a student) has not attempted it yet
** For training = the student has attempted the quiz at least X times (X because if after X attempts has not passed counts as failed) or the student passed the quiz
* '''Based on assumptions (static)''': no, it is based on previous students records

== Clustered environments ==

'analytics/outputdir' setting can be used by Moodle sites with multiple frontend nodes to specify a shared directory across nodes. This directory can be used by machine learning backends to store trained algorithms (its internal variables weights and stuff like that) to use them later to get predictions. Moodle cron lock will prevent multiple executions of the analytics tasks that train machine learning algorithms and get predictions from them.

Analytics models

2017-10-11T10:05:12Z

Dmonllao: /* Settings */

This page describes the Analytics models tool used to visualise, manage and evaluate prediction models.

== Settings ==

You can access ''Analytics settings'' from ''Site administration > Analytics > Analytics settings''.

=== Predictions processor ===

Prediction processors are the machine learning backends that process the datasets generated from the calculated indicators and targets and return predictions. This plugin is shipped with 2 prediction processors:

* The PHP one is the default
* The Python one is more powerful and it generates graphs with the model performance but it requires setting up extra stuff like Python itself (https://wiki.python.org/moin/BeginnersGuide/Download) and the moodlemlbackend package.

pip install moodlemlbackend

=== Time splitting methods ===

The time splitting method divides the course duration in parts, the predictions engine will run at the end of these parts. It is recommended that you only enable the time splitting methods you could be interested on using; the site contents analyser will calculate all indicators using each of the enabled time splitting methods. The more enabled time splitting methods the slower the evaluation process will be.

== Models management ==

You can access the tool from ''Site Administration > Analytics > Analytics models'' to see the list of prediction models.

[[File:prediction-models-list.jpeg]]

These are some of the actions you can perform on a model:

* ''Edit:'' You can edit the models by modifying the list of indicators or the time-splitting method. All previous predictions will be deleted when a model is modified. Models based on assumptions (static models) can not be edited.
* ''Enable / Disable:'' The scheduled task that trains machine learning algorithms with the new data available on the system and gets predictions for ongoing courses skips disabled models. Previous predictions generated by disabled models are not available until the model is enabled again.
* ''Evaluate:'' Evaluate the prediction model by getting all the training data available on the site, calculating all the indicators and the target and passing the resulting dataset to machine learning backends, they will split the dataset into training data and testing data and calculate its accuracy. Note that the evaluation process use all information available on the site, even if it is very old, the accuracy returned by the evaluation process will generally be lower than the real model accuracy as indicators are more reliably calculated straight after training data is available because the site state changes along time. The metric used as accuracy is the ''Matthew’s correlation coefficient'' (good metric for binary classifications)
* ''Log:'' View previous evaluations log, including the model accuracy as well as other technical information generated by the machine learning backends like ROC curves, learning curves graphs, the tensorboard log dir or the model's Matthew’s correlation coefficient. The information available will depend on the machine learning backend in use.
* ''Get predictions:'' Train machine learning algorithms with the new data available on the system and get predictions for ongoing courses. ''Predictions are not limited to ongoing courses, this depends on the model.''

[[File:model-evaluation.jpeg]]

== Predictions and Insights ==

Models will start generating predictions at different point in time, depending on the site prediction models and the site courses start and end dates.

Each model defines which predictions will generate insights and which predictions will be ignored. This is an example of ''Students at risk of dropping out'' prediction model; if a student is predicted as not at risk no insight is generated as what is interesting is to know which students are at risk of dropping out of courses, not which students are not at risk.

[[File:prediction-model-insights.jpeg]]

Analytics models

2017-10-11T10:04:57Z

Dmonllao: /* Models management */

This page describes the Analytics models tool used to visualise, manage and evaluate prediction models.

== Settings ==

You can access ''Analytics settings'' from ''Site administration > Appearance > Analytics settings''.

=== Predictions processor ===

Prediction processors are the machine learning backends that process the datasets generated from the calculated indicators and targets and return predictions. This plugin is shipped with 2 prediction processors:

* The PHP one is the default
* The Python one is more powerful and it generates graphs with the model performance but it requires setting up extra stuff like Python itself (https://wiki.python.org/moin/BeginnersGuide/Download) and the moodlemlbackend package.

pip install moodlemlbackend

=== Time splitting methods ===

The time splitting method divides the course duration in parts, the predictions engine will run at the end of these parts. It is recommended that you only enable the time splitting methods you could be interested on using; the site contents analyser will calculate all indicators using each of the enabled time splitting methods. The more enabled time splitting methods the slower the evaluation process will be.

== Models management ==

You can access the tool from ''Site Administration > Analytics > Analytics models'' to see the list of prediction models.

[[File:prediction-models-list.jpeg]]

These are some of the actions you can perform on a model:

* ''Edit:'' You can edit the models by modifying the list of indicators or the time-splitting method. All previous predictions will be deleted when a model is modified. Models based on assumptions (static models) can not be edited.
* ''Enable / Disable:'' The scheduled task that trains machine learning algorithms with the new data available on the system and gets predictions for ongoing courses skips disabled models. Previous predictions generated by disabled models are not available until the model is enabled again.
* ''Evaluate:'' Evaluate the prediction model by getting all the training data available on the site, calculating all the indicators and the target and passing the resulting dataset to machine learning backends, they will split the dataset into training data and testing data and calculate its accuracy. Note that the evaluation process use all information available on the site, even if it is very old, the accuracy returned by the evaluation process will generally be lower than the real model accuracy as indicators are more reliably calculated straight after training data is available because the site state changes along time. The metric used as accuracy is the ''Matthew’s correlation coefficient'' (good metric for binary classifications)
* ''Log:'' View previous evaluations log, including the model accuracy as well as other technical information generated by the machine learning backends like ROC curves, learning curves graphs, the tensorboard log dir or the model's Matthew’s correlation coefficient. The information available will depend on the machine learning backend in use.
* ''Get predictions:'' Train machine learning algorithms with the new data available on the system and get predictions for ongoing courses. ''Predictions are not limited to ongoing courses, this depends on the model.''

[[File:model-evaluation.jpeg]]

== Predictions and Insights ==

Models will start generating predictions at different point in time, depending on the site prediction models and the site courses start and end dates.

Each model defines which predictions will generate insights and which predictions will be ignored. This is an example of ''Students at risk of dropping out'' prediction model; if a student is predicted as not at risk no insight is generated as what is interesting is to know which students are at risk of dropping out of courses, not which students are not at risk.

[[File:prediction-model-insights.jpeg]]