Project Inspire API
|Project state||Development in progress|
- 1 Project inspire
- 1.1 Summary
- 1.2 Built-in models
- 1.3 Concepts
- 1.4 Design
- 1.5 Extensibility
This is the development docs page for project inspire, it is a work in progress project to add predictive analytics to Moodle.
It describes the proposed project for Moodle developers although the first sections and some concepts should be understandable for most of the people.
Site admins will be able to define models that will combine indicators and a target. The target is what we want to predict, the indicators is what we think that will lead to an accurate prediction. Moodle will be able to evaluate these models and, if the model accuracy is good enough, Moodle will internally train a machine learning algorithm by calculating the defined indicators with the site data. Once new data that matches the criteria defined by the model is available Moodle will start predicting what is most likely to happen. Targets will be totally free to define what actions will be performed for each prediction, from sending messages or feeding reports to building new adaptive learning activities.
An obvious example of a model you may be interested on is prevention of students at risk of drop out: Lack of participation or bad grades in previous activities could be indicators, the target could be whether the student is able to complete the course or not. Moodle will calculate these indicators and the target for each student in a finished course and predict which students are at risk of dropping out in ongoing courses.
People use Moodle in very different ways and even the same site courses can vary significantly from each other. Moodle should only be shipped with models that have been proven to be good at predicting in a wide range of sites and courses.
To diversify the samples and to cover a wider range of cases Moodle HQ research team will be collecting anonymised Moodle site's datasets from collaborating institutions and partners to train the machine learning algorithms with them. The models that Moodle will be shipped with will obviously be better at predicting on these institutions sites, although some other datasets will be used as test data for the machine learning algorithm to ensure that the models are good enough to predict accurately in any Moodle site. If you are interested in collaborating please contact Elizabeth Dalton at [email@example.com] to get information about the process.
Even if the models we will ship Moodle with will be already trained by Moodle HQ, each different site will continue training that site's machine learning algorithms with its own data, which should lead to better prediction accuracy over time.
Definition for people not familiar with machine learning concepts: It is a process we need to run before being able to predict anything, we record what already happened so we can predict later what is likely to happen under the same circumstances. What we train are machine learning algorithms.
As explained above a model is a combination of indicators and targets. We will use them for prediction.
The relation between indicators and targets is stored in tool_inspire_models database table.
The class manages all models related actions. evaluate(), train() and predict() forward the calculated indicators to the prediction processors. It delegates all heavy processing to analysers and prediction processors. It also manages model evaluation logs.
\tool_inspire\model class is not expected to be extended.
Analyser classes are responsible of creating the dataset files that will be sent to the prediction processors.
The base class \tool_inspire\local\analyser\base does most of the stuff, it contains a key abstract method though, get_all_samples(), this method is what defines what is a sample. A sample can be any moodle entity: a course, a user, an enrolment, a quiz attempt... Samples are nothing by themself, just a list of ids, it is when combined with the target and the indicator classes that they make sense.
Other analyser classes responsibilities:
- Define the context of the predictions
- Discard invalid data
- Filter out already trained samples
- Include the time factor (time range processors, explained below)
- Forward calculations to indicators and target classes
- Record all calculations in a file
- Record all analysed sample ids in the database
If you are introducing a new analyser there is an important non-obvious fact you should know about: For scalability reasons all calculations at course level are executed in per-course basis and the resulting datasets are merged together once all site courses analysis is complete. We do it this way because depending on the sites' size it could take hours to complete the analysis of all the site, this is a good way to break the process up into pieces. When coding a new analyser you need to decide if you want to extend \tool_inspire\local\analyser\by_course (your analyser will process a list of courses) or \tool_inspire\local\analyser\sitewide (your analyser will receive just one analysable elements, the site)
Target classes define what we want to predict and calculates it across the site. It also defines the actions to perform depending on the received predictions.
Targets depend on analysers because analysers provide them with the samples they need. Analysers are separate entities to targets because they can be reused across targets. Each target needs to specify what analyser it is using. A few examples in case it is not clear the difference between analysers, samples and targets:
- Target: 'students at risk of dropping out'. Analyser provides: 'course student'
- Target: 'spammer'. Analyser provides: 'site users'
- Target: 'ineffective course'. Analyser provides: 'courses'
- Target: 'difficulties to pass a specific quiz'. Analyser provides: 'quiz attempts in a specific quiz'
A callback defined by the target will be executed once new predictions start coming so each target have control over the prediction results.
Only binary classifications will be initially supported although the API will be extended in future to support discrete (multiclass classification) and continuous values (linear regression).
Another aspect controlled by targets is the insights generation. Analysers samples always have a context, the context level (activity module, course, user...) depends on the sample but they always have a context, this context will be used to notify users with tool/inspire:listinsights capability (teacher role by default) about new insights being available. These users will receive a notification with a link to the predictions page where all predictions of that context are listed.
Each prediction will have a set of actions available, in cases like Students at risk of dropping out the actions can be to send a message to the student, to view their course activity report...
Also defined as PHP classes, their responsibility is quite simple, to calculate the indicator using the provided sample.
Indicators are not limited to one single analyser like targets are, this makes indicators easier to reuse along models. Indicators specify a minimum set of data they need to perform the calculation. The indicator developer should also make an effort to imagine how the indicator will work when different analysers are used. For example an indicator named Posts in any forum could be initially coded for a Shy students in a course target; this target would use course enrolments analyser, so the indicator developer knows that a course and a user will be provided by that analyser, but this indicator can be easily coded so the indicator can be reused by other analysers like courses or users'. In this case the developer can chose to require course or user, the name of the indicator would change according to that: User posts in any forum, which could be used in models like Inactive users or Posts in any of the course forums, which could be used in models like Low participation courses
The calculated value can go from -1 (minimum) to 1 (maximum). This guarantees that we will not have indicators like absolute number of write actions because we will be forced to limit the calculation to a range, e.g. -1 = 0 actions, -0.33 = some basic activity, 0.33 = activity, 1 = plenty of activity.
Time splitting methods
In some cases the time factor is not important and we just want to classify a sample, that is fine, things get more complicated when we want to predict what will happen in future. E.g. predictions about students at risk of dropping out are not useful once the course is over or when it is too late for any intervention.
Calculations in time ranges is the most challenging aspect this project faces. Indicators need to be designed with this in mind and we need to include time-dependant indicators to the calculated indicators so machine learning algorithms are smart enough to avoid mixing calculations belonging to the beginning of the course with calculations belonging to the end of the course.
There are many different ways to split up a course in time ranges: in weeks, quarters, 8 parts, ranges with longer periods at the beginning and shorter periods at the end... And the ranges can be accumulative (from the beginning of the couse) or only from the start of the time range.
Which time split method will be better at predicting depends on the model. The model evaluation processs iterates through all enabled time splitting methods and returns the prediction accuracy of each of them.
Prediction processors are the machine learning backends that process the datasets generated from the calculated indicators and targets. They are a tool_inspire plugin subtype, this makes it a pluginable system with a common interface:
- Evaluate a provided model
- Train a machine learning algorithm with the existing site data
- Predict targets based on previously trained algorithms.
The communication between prediction processors and Moodle is through files because these processors can be writen in PHP, in Python, other languages or even use cloud services. This needs to be scalable so they are expected to be able to manage big files and train algorithms reading input files in batches if necessary.
The system will be initially designed as a Moodle admin tool plugin, aiming to move most of the code to a core API when we are ready to include the plugin in core.
Moodle plugins will be able to add and/or redefine any of the entities involved in all the data modeling process.
Some of the base classes to extend or follow as example:
Read more in #Extensibility.
Interface to be implemented by prediction processors. Pretty basic interface, just methods to train, predict and evaluate datasets.
Analysable items are analysed by analysers :) In most of the cases an analysable will be a course, although it can also be the site or any other element.
We will include two analysers \tool_inspire\course and \tool_inspire\site they should be enough for most of the stuff you might want to analyse. They need to provide an id, a \context and get_start() and get_end() methods if you expect them to be used on models that require time splitting. Read related comments above in #Analyser.
A possible extension point of \tool_inspire\sitewide would be to redefine get_start() and get_end().
Indicators and targets must implement this interface.
It is already implemented by \tool_inspire\local\indicator\base and \tool_inspire\local\target\base but you can still code targets or indicators from scratch if you need more control.
This projects aims to be as extendable as possible. Any moodle component, including third party plugins, will be able to define indicators, targets, analysers and time splitting processors.
An example of a possible extension would be a plugin with indicators that fetch student academic records from the Universities' student information system; the site admin could build a new model on top of the built-in 'students at risk of drop out detection' adding the SIS indicators to improve the model accuracy or for research purposes.