Preg question type: Difference between revisions

Revision as of 18:21, 24 July 2012

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the usage of the Preg question type section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:

Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.

We would gladly accept testers and contributors (see the development plans section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

Understanding expressions

The regular expressions - as any expressions - are just a bunch of operators with their operands. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: x+y*2. There are two operators: '+' and '*'. The operands of '*' are 'y' and '2'. The operands of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite order of evaluation, governed by operator's precedence. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: (x+y)*2 will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their arity - this is just the number of operands required. In the example above '+' and '*' are binary operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the unary (single operand) form, like in this equation: y=-x. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of operators with correct number of operands for each (arity), taking heed of their evaluation order by using their precedence and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

Regular expressions

Regular expressions is a powerful mechanism for searching in strings using patterns. So their operands are characters or character sets. A is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be escaped when used as operands - preceded by a backslash. These special characters are:

\ ^ $ . [ ] | ( ) ? * + { }

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use any character as an operand.

Operands

Here's an incomplete list of operands that define character sets.

Simple characters (with no special meaning) match themselves.
Escaped special characters match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\". NOTE! when you are unsure whether to escape some character, it is safe to place "\" before any character except letters and digits. Do not escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
Character sets match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
- "[ab,!]" matches "a", "b", "," and "!";
- "[a-szC-F0-9]" contains ranges (defined by a hyphen between 2 characters) "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
- "[^a-z-]" starts with the "^" that means a negative character set: it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
- "[\-\]\\]" contains escaping inside a character set: it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
Dot meta-character (".") matches any possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
Escape sequences for common character sets:
- "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
- "\s" for any space character and "\S" for any non-space character;
- "\d" for any digit and "\D" for any non-digit.
Simple assertions - they are not characters, but conditions to test, they don't consume characters while matching, unlike other operands:
- "^" matches in the start of the string, fails otherwise;
- "$" matches in the end of the string, fails otherwise;
- "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
- "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the operators that allow us to define an expression which matches strings of several characters.

Operators

Here's a list of the common regex operators:

Concatenation - so simple binary operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
- "ab" matches "ab";
- "a[0-9]" matches "a" followed by any digit, for example, "a5"
Alternative - a binary operator that lets you define a set of alternatives:
- "a|b" matches "a" or "b";
- "ab|cd|de" matches "ab" or "cd" or "de";
- "ab|cd|" matches "ab" or "cd" or emptiness (useful as a part in more complex expressions);
- "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
- "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
Quantifiers - an unary operator that lets you define repetition of something used as its operand:
- "x*" matches "x" zero or more times;
- "x+" matches "x" one or more times;
- "x?" matches "x" zero or one times;
- "x{2,4}" matches "x" from 2 to 4 times;
- "x{2,}" matches "x" two or more times;
- "x{,2}" matches "x" from 0 to 2 times;
- "x{2}" matches "x" exactly 2 times;
- "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
- "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

Precedence and order of evaluation

A Quantifier has precedence over concatenation and concatenation has precedence over alternative. Let's look what it means:

quantifiers over concatenation means that quantifiers are executed first and will repeat only a single character if used without parentheses:
- "ab*" matches "a" followed by zero or more "b";
- "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
concatenation over alternative means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
- "ab|cd|de" matches "ab" or "cd" or "de";
- "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
quantifier over alternative means that you should use parentheses to repeat an alternative set:
- "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
- "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
- "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
- "a{2}|b{2}" matches "aa" or "bb" only.

Assertions

Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:

positive lookahead assertion "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
negative lookahead assertion "a+(?!b)" matches any number of "a" that is not followed by "b";
positive lookbehind assertion "(?<=b)a+" matches any number of "a" preceeded by "b";
negative lookbehind assertion "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

Matching

Matching means finding a part of the student's answer that suits the regular expression. This part called match. You should enter regular expressions as answers to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as correct answer. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see anchoring) and give a grade from it. If there is no full match and engine supports partial matching (see hinting) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

Anchoring

Anchoring is used to set restrictions on the matching process by using simple assertions:

if a regular expression starts with the ^ the match should start at the start of the student's response;
if a regular expression ends with the $ the match should end at the end of the student's reponse;
otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:

"^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
"^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
"^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the exact matching options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

Local case-sensitivity modifiers

Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:

"(?i)" will turn case-sensitivity off;
"(?-i)" will turn case-sensitivity on.

This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:

"abc(de(?i)gh)xyz" will have the bold part case-insensitive;
"abc(de)(?i)ghxyz" will have the bold part case-insensitive.

Usage of the Preg question type

Basically, this question type is an extended version of the Shortanswer.

Notations

Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The Regular expression which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the Moodle shortanswer notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the hinting section to understand various settings you can alter to configure you question hinting behaviour.

Hinting

Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with partial matching. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:

 are blue, white(,| and) red

and a student answered:

 they are blue, vhite and red

Partial matching will find that the partial match is

 are blue,

Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:

 they are blue, vhite and red

When hinting is available, student will have the hint button by pressing which he receives a hint with one next correct character, highlighted by background coloring:

 they are blue, wvhite and red

You should typically set hint penalty more than usual question penalty, because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting. Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:

it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press hint, which is not a desirable behavour usually.

When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:

 are blue, white; red

There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:

 are white(,| and) red

with 60% grade and feedback about forgetting blue. You may not want hinting to lead student to the response

  are white, red

if he entered

  are white, oh I forgot other colors.

Hint grade border controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

Subpattern capturing and feedback

Any pair of parentheses in a regex are considered as a subpattern and when matching the engine remembers (captures) not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely 0 is the whole regex, 1 is first subpattern etc. You can insert them in the answer's feedback using simple placeholders: {$0} is replaced by the whole match, {$1} by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the general feedback because different answers can have different number of subpatterns.

PHP preg engine and NFA support full subpattern capturing. DFA engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:

[+\-]?([0-9]+)?\.([0-9]+)

It has two subpatterns: first capturing integral part, second - fractional part of the number. If you wrote the feedback:

The number is: {$0} Integral part is {$1} and fractional part is {$2}

Then a student entered

123.34

He will see

The number is: 123.34 Integral part is 123 and fractional part is 34

If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

Error reporting

Native PHP preg extension functions only report if there is an error in regular expression or not, so PHP preg extension engine can't tell you much about the error.

NFA and DFA engines use a custom regular expression parser, so they support the advanced error reporting. The are several classes of potential errors:

more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
unopened closing parenthesis "abc)";
unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
empty parentheses of any sort "(?=)";
quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
unclosed brackets of character classes "[a-fA-F\d";
setting and unsetting the same modifier at the same time "(?i-i)";
unknown unicode properties "\p{Squirrel}";
unknown posix classes "hamster:";
unknown (*...) sequence "(*QWERTY)";
incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as non-errors, making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

Looking for missing things

Joseph Rezeau's REGEXP question type has a missing words feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with negative assertions combined with anchoring the matching start. The regular expression to look for the missing word necessary would be

 ^(?!.*\bnecessary\b.*)

where

(?!.*\bnecessary\b.*) is a negative lookahead assertion, that allows matching only if there is no word necessary ahead of some point in the string;
^ is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with ^(?! and ). Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. (?! part). We are working on it, but it requires some complex work.

Matching engines

A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.

PHP preg extension

It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is no hinting there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

Deterministic finite state automata (DFA)

This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support hinting.

Currently supported operands:

single characters
escaped special characters
character classes, including ranges and negative classes
escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
octal and hexadecimal character codes preceeded by \o and \x
meta-character . (any character)
unicode properties

Currently supported operators:

concatenation
alternative |
quantifiers * + ? {2,3} {2,} {,2} {2}
positive lookahead assertions
changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:

subpattern capturing
backreferences

Non-deterministing finite state automata(NFA)

NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:

subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

The ways to give back

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:

publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would help even more - please inform me immediately if you consider this;
if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

Development plans

There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.

Improve simple assertions support
Support for complex assertions
Hinting not one character, but completion of the whole world
Add automatic generation of shortest possible correct answer in user-readable form
Add a set of authoring tools to make writing regular expressions easier
Develop the backtracking matching engine
Develop more help and examples for the people that don't know much about regular expressions.

Documentation