Preg question type

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:

idea, design, question type code, behaviours code, regex parsing and error reporting - Oleg Sychev;
regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
NFA regular expression matching engine - Valeriy Streltsov

We would gladly accept testers and contributors (see development plans section) - there is still more to be done than we have time. Thanks for Joseph Rezeau for been devoted tester of Preg question type releases and original authors of many ideas, that was implemented in preg question type.

Notations

Starting from Preg 2.1, notation feature allows you to choose notation in which regular expressions for answers will be written. Regular expression is a default one.

One exciting part of it is that you could use preg question type just as improved shortanswer, having access to the hinting facility without any need to understand regular expressions at all! Just choose Moodle shortanswer notation and you could just copy answers from you shortanswer questions. '*' wildcard is supported. Choosing NFA or DFA engine you could get access to the hinting. You could omit all that is said on regular expression topic there, but be sure to read hinting section below to understand various settings you could alter to configure you question hniting behaviour.

Understanding expressions

The regular expressions - as any expressions - are just a bunch of operators with their operands. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: x+y*2. There are two operators there: '+' and '*'. The operands of '*' is 'y' and '2'. The operands of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite order of evaluation there, governed by operator's precedence. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: (x+y)*2 will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their arity, which is just the number of operands required. In example above '+' and '*' are binary operators - they both take two operands. Most arithmetic operators are binary, but minus has unary (single operand) form, like in this equation: y=-x. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of operators with correct number of operands for each (arity), taking heed of their order of evaluation using their precedence and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

Regular expressions

The goal of a regular expressions is a pattern matching in the strings. So their operands are characters or characters set. A is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be escaped when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes operators that allows us to define expression matching with string of a several characters.

Operands

You could use those operands in you expressions:

simple characters match with themselves
escaped special characters if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: a\* matches with a* (while a* matches with a zero or more times), backslash is a special character too and should be escaped \\ matches with \
character classes you could specify a number of possible characters in one place in square brackets:
- [ab,!] matches with a or b or , or !
- ranges: [a-szC-F0-9] you could specify ranges for letters and digits in character classes, mixing them with single characters
- negative character classes starts with ^ [^ab] means any characters except a and b
- escaping inside character classes: [\-\]\\] match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
dot meta-character . match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot \. if you need to match single dot.

Operators

Most common regular expression operators used (could anyone help expand descriptions and examples please?):

concatenation - so simple binary operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
- ab matches with ab
- a[0-9] matches with a followed by any digit
alternative - binary operator that lets you define a set of alternatives:
- a|b mean a or b
- ab|cd|de mean ab or cd or de
- empty alternative: ab|cd| mean ab or cd or emptiness (useful as a part more complex expressions)
- (aa|bb)c mean aac or bbc - use brackets to outline alternative set
- (aa|bb|)c mean aac or bbc or c - typical use of emptiness
quantifiers - unary operator that lets you define repetition of a character (or regular expression) used as it's operands:
- x* mean x zero or more times
- x+ mean x one or more times
- x? mean x zero or one times
- x{2,4} mean x from 2 to 4 times
- x{2,} mean x two or more times
- x{,2} mean x from 0 to 2 times
- x{2} mean x exactly 2 times
- (ab)* mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
- (a|b){2} mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

Precedence and order of evaluation

Quantifier has precedence over concatenation and concatenation has precedence over alternative. Let's look what it means:

quantifier over concatenation means quantifiers are executed first and without brackets would repeat only single character:
- ab* matches with a followed with zero or more b's
- changing this using brackets allows us define a string repetition: (ab)* matches with ab zero or more times
concatenation over alternative means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
- ab|cd|de matches with ab or cd or de
- (aa|bb)c matches with aac or bbc - use brackets to outline alternative set
- (aa|bb|)c matches with aac or bbc or c - typical use of an empty alternative
quantifier over alternative means you should use brackets to repeat an alternative set (not the last character in it):
- ab|cd* matches with ab or c followed with zero or more d's
- (ab|cd)* matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
- note that quantifiers repeats alternative, not the definite selection from it, i.e.:
  1. (a|b){2} matches with aa or ab or ba or bb, not just aa or bb
  2. use a{2}|b{2} to match aa or bb only

Assertions

Assertions are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.

positive lookahead assert a+(?=b) matches with any number of a ending with b without including b in the match
negative lookahead assert a+(?!b) matches with any number of a that is not followed by b
positive lookbehind assert (?<=b)a+ matches with any number of a preceeded by b
negative lookbehind assert (?<!b)a+ matches with any number of a that is not preceeded by b

Matching

Matching means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a match.

You should enter regular expressions as answers to the question without modifiers or enclosing characters (modifiers would be added for you by question - u added always and i in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as correct answer. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see anchoring) and give a grade from it. If there is no full match and engine supports partial matching (see hinting) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

Anchoring

Anchoring sets restrictions on the matching process:

if a regular expression starts with ^ the match should start at the start of the student's response;
if a regular exhression ends with $ the match should ends at the end of the student's reponse;
otherwise regular expression match could be contained anywhere inside student response.

If you set exact matching options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

Hinting

Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with partial matching. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:

 are blue, white(,| and) red

and student answered

 they are blue, vhite and red

Partial matching will find that partial match is

 are blue,

Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:

 they are blue, vhite and red

When hinting is available, student will have hint button by pressing which he receive a hint with one next correct character, highlighted by background coloring:

 they are blue, wvhite and red

You should typically set hint penalty more than usual question penalty, because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting. Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:

it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press hint, which is not a desirable behavour usually.

When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:

 are blue, white; red

There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:

 are white(,| and) red

with 60% grade and feedback about forgetting blue. You may not want hinting to lead student to the response

  are white, red

if he entered

  are white, oh I forgot other colors.

Hint grade border controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

Subpattern capturing and feedback

Any pair of round brackets in the regular expressions are considered a subpattern and when doing matching engine (supporting subpatterns) remember (capture) not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely 0 is the whole match, 1 is first subpattern etc. You could insert them in the answer's feedback using simple placeholders: {$0} is replaced by the whole match, {$1} by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the general feedback because different answers could have different number of subpatterns.

PHP preg engine support full subpattern capturing. DFA engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:

[+\-]?([0-9]+)?\.([0-9]+)

It has two subpatterns: first capturing integral part, second - fractional part of the number. You writed feedback:

The number is: {$0} Integral part is {$1} and fractional part is {$2}

Then entering

123.34

the student will see

The number is: 123.34 Integral part is 123 and fractional part is 34

If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

Error reporting

Native PHP preg extension functions only report if there error in regular expression or not, so PHP preg extension engine couldn't tell you much about what is error .

DFA engine use custom regular expression parser, so it supports advanced error reporting. The are several class of potential errors reported:

unclosed square brackets of character class;
unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
unopened closing parenthesis;
empty parenthesis of any sort (different forms of subpatterns and assertions);
quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
three or more top-level alternatives in the conditional subpattern.

PCRE (and preg functions) treat most of them as non-errors, making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

Looking for missing things

Joseph Rezeau REGEXP question type has a special syntax for missing words feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student). So if you want to look out for the missing word necessary in the response, you'll add this answer (WARNING - REGEXP only syntax on the next line):

 --.*\bnecessary\b.*

where \b defines a word boundary, while .* ensures that this word could be anywhere in the response.

There is no need to have such features in the PREG question type, since similar effect could be achieved with negative assertions combined with anchoring the matching start. Equivalent regular expression to look for the missing word necessary would be

 ^(?!.*\bnecessary\b.*)

where

(?!.*\bnecessary\b.*) is a negative lookahead assertion, that allows matching only if there are no word necessary ahead of some point in the string;
^ is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with ^(?! and ).

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. (?! part). We are working on it, but it stil some time away, requiring further complex work.

Matching engines

Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.

PHP preg extension

It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is no hinting there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

Deterministic finite state automata (DFA)

This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support hinting.

Currently supported operands (there would be more):

single characters
escaped special characters
character classes, including ranges and negative classes
escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
octal and hexadecimal character codes preceeded by \o and \x
meta-character . (any character)

Currently supported operators (there would be more):

concatenation
alternative |
quantifiers * + ? {2,3} {2,} {,2} {2}
positive lookahead assertions
changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:

subpattern capturing
backreferences

Non-deterministing finite state automata(NFA)

NFA engine, introduced in 2.1 release, is a custom matcher that basically could do anything DFA matcher could, with addition of subpattern capturing and backreferences.

Now you don't have to choose between hiting and using captured subpatterns in you questions: NFA could do them both!

Development plans

There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.

Improved simple assertions support
Support for complex assertions
Hinting not one character, but completion of the whole world
Add automatic generation of shortest possible correct answer in user-readable form
Add a set of authoring tools to make writing regular expressions easier
Develop backtracking matching engine
Develop more help and examples for the people that don't know much about regular expressions.
Improve Unicode support of custom matching engines

Documentation