MoodleDocs - Contribucions de l'usuari [ca]

Fitxer:CorrectWritingStudentAnswer.PNG

2013-03-02T22:41:38Z

Oasychev:

Fitxer:CorrectWritingEditForm.PNG

2013-03-02T22:41:22Z

Oasychev:

Fitxer:CorrectWritingAnswerStep2.PNG

2013-03-02T22:41:05Z

Oasychev:

Fitxer:CorrectWritingAnswerStep1.PNG

2013-03-02T22:40:46Z

Oasychev:

question/type/correctwriting

2013-03-02T22:39:37Z

Oasychev: Created page with "{{Questions}}===Goal=== This question type aims to automatically report of student mistakes in a shortanswer question, when student is learning programming language or in simple..."

{{Questions}}===Goal===

This question type aims to automatically report of student mistakes in a shortanswer question, when student is learning programming language or in simple cases of natural language learning, when student must write correct string (statement, sentence), and we should give him information about skipping some words or placing words in incorrect order. This also could be useful, when we don't give actual information about what words (or some kinds of symbols) was moved or absent in answer, but their meaning, thus forcing student to know what does this part of symbol means. This allows for some training without direct teacher supervision.

===How input is splitted===
Teacher should enter one or several correct answers (with possible feedback describing particular answer). These answers are ''tokenized'' (this process will be called ''scanning'' later) to break them down to smallest meaningful parts (''tokens'' or ''lexems''): words, numbers, punctuation marks, operators etc. These parts depend on the language used.

The question will later analyze and print errors in the placement of these tokens. But when we are telling student, that ''function name'' (or a ''subject'') in his answer is misplaced (or absent), we often don't want to disclose to the student information about exact word for that function name or subject. If student will see the word he misplaced, he could just start trying to move it. If the student will see grammatical role of this word (e.g. "function name" or "subject" misplaced), that will stimulate him to think in the grammatical categories (i.e. what is a subject in my response and where it should be?). So the teacher asked to enter grammatical descriptions for each token in the correct answer. That is done as a two-step process.

In answer form, you can enter answer and supply a specific feedback for it:

[[File:CorrectWritingAnswerStep1.PNG]]

Than hit "Save changes". The answer form changes, to allow you to enter a description for each token. You can skip descriptions for some token, just hitting enter, making an empty line in text field, then the student will the literal representation of the token instead:

[[File:CorrectWritingAnswerStep2.PNG]]

Than press "Save changes" button again.

===Question type settings===

When creating new CorrectWriting question type you will see the form with following settings:

[[File:CorrectWritingEditForm.PNG]]

The single most important setting therre is an '''answers language'''. It defines a way, in which you answer will be breaked down to lexemes (tokens). Other settings are much less important and you may not need to worry about them first. Most of them are used to fine tune question grading and are advanced.

You can see the following parameters:

; Lexical error threshold : a treshold, which will be used when comparing tokens from student response to a teacher answer will be used to match with errors with Levenstein distance. If amount of errors when comparing two tokens is lesser than product of this treshold and length of teacher token - than tokens are the same with errors. Since search for misspellings is not implementend this parameter is '''unused and hidden in form'''.
; Penalty for lexical error : a penalty, that will be substracted from grade, when found one misspelling. Since search for misspellings is not implementend this parameter is '''unused and hidden in form'''.
; Penalty for missing token: a penalty, that will be substracted from grade for each absent token in student response.
; Penalty for extra token: a penalty, that will be substracted from grade for each odd token in student response.
; Penalty for misplaced token: a penalty, that will be substracted from grade for each misplaced token in student response, that is placed in incorrect order, which is taken from teacher's answer.
; Minimum grade for answer to find and display mistakes : sometimes, teacher may point out for student some hints for especially bad answers, and their comments must be viewed, when student's and teacher's answers are the same. Teacher can define an answer with grade, lower than that border for this answer and point out some bad answer with custom error message
; Maximum percent of mistakes in students response: when student writes response with lots of mistakes, we can reject his answer in favor of matching with another teacher's answer, when count of mistakes is bigger than product of this parameter and count of parts in answer.
; Language of answer : a language, which will be used, when analizing some answer and student response. Currently supported languages are ''english language'' and ''C programming language''.
; Hinting settings : allows to enable particular hints for multi-stage behaviours and setting penalties for their use. See next sections for details about hints.

===Hinting===
The CorrectWriting question type uses hinting behaviours and is able to do special hints - for now in adaptive behaviour - for a penalty in the grade (may be set to 0). You could enable them setting penalty for the hint below 1.

; What is hint : tells student a token text instead of description. I.e. '''The subject is "cat".''' Used for misplaced token and absent token mistakes. For absent token mistake the penalty is multiplied to the ''absent hint penalty factor'', since it discloses exact text of the word student omitted.
; Where is text hint : shows a message where a token should be placed, based on you answer, using descriptions where possible. I.e. '''The subject should be placed between definite article and verb.''' Used for misplaced token and absent token mistakes.
; Where is picture hint : Under development.

===Example of grading a student response===

Please consider the following example, for question, partially described above:

[[File:CorrectWritingStudentAnswer.PNG]]

You see the examples of student mistakes. You can toggle various options of quiz, hiding some picture of mistakes or hiding some mistakes sentences for your own purpose.

===Installing CorrectWriting question type===

To install this question type, you need [http://code.google.com/r/oasychev-formallangs-block/ formal languages block] and [http://code.google.com/r/oasychev-correctwriting/ question type]. Also you need a [http://code.google.com/p/oasychev-moodle-plugins poas question type].

Just copy all of them in your Moodle installation folder, overwriting if need to and go to Site administration in web-interface, proceeding with the installation and you're done.

==Examples==

Please, consider the following examples.

'''Function header in C language'''

Student must write simple function header for function, that is described naturally: void function(int abc, char def), with following descriptions of parts:

; void : type of returned value
; function : function name
; ( : bracket (or opening bracket for function arguments)
; int : first argument type
; abc : first argument name
; , : argument list separator
; char : second argument type
; def : second argument name
; ) : bracket (or closing bracket for function arguments)

If student submits the following answer, like: ''function(abc, def) void '', question type will produce the following output:

; type of returned value is misplaced
; first argument type is missing
; second argument type is missing

'''English language sentence''':

Student must write simple sentence, while learning foreign language: The cat ate the mouse, with following descriptions of parts
; The : definite article
; cat : subject
; ate : verb
; the : definite article
; mouse : complement
; . : sentence ending point

If student submits the following answer, like: The cat eat the mouse, question type will produce the following output:

; "eat" should not be in response
; verb is missing
; sententence ending point is missing.

Preg question type

2013-01-06T21:46:59Z

Oasychev:

{{Questions}}The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes <nowiki>"[[:hamster:]]"</nowiki>;
* unknown (*...) sequence "(*QWERTY)";
* incorrect character set range "[z-a]";
* incorrect quantifier ranges "{5,3}";
* \ at end of pattern "ab\";
* \c at end of pattern "ab\c";
* invalid escape sequence;
* POSIX class ouside of a character set "[:digit:]";
* reference to unexisting subpattern (abc)\2;
* unknown, wrong or unsupported modifier "(?z)";
* missing ) after comment "(?#comment";
* missing conditional subpattern name ending;
* missing ) after (?C;
* missing subpattern name ending;
* missing backreference name ending;
* missing backreference name beginning;
* missing ) after control sequence;
* wrong conditional subpattern number, digits expected;
* assertion or condition expected "(?()a|b)";
* character code too big "\x{ffffffff}";
* character code disallowed "\x{d800}";
* invalid condition (?(0);
* too big number in (?C...) "(?C256)";
* two named subpatterns have the same name "(?<name>a)(?<name>b)";
* backreference to the whole expression "abc\g{0}";
* different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
* subpattern name expected "(?<>abc)";
* \c should be followed by an ascii character "\cй";
* \L, \l, \N{name}, \U, and \u are unsupported;
* unrecognized character after (?<.

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

question/type/preg

2013-01-06T21:45:52Z

Oasychev: Redirected page to Preg question type

#REDIRECT [[Preg question type]]

Preg question type

2012-08-06T20:32:21Z

Oasychev: /* Development plans */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Support for regular expresison recursion
* Support for approximate matching to catch typos in answers
* Add a set of authoring tools to make writing regular expressions easier
* Add more languages for next lexem hinting
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T20:06:41Z

Oasychev: /* Error reporting */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you can use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (can be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that can't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:46:01Z

Oasychev: /* Next lexem hinting */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (could be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:44:02Z

Oasychev: /* Operands */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
# '''Character classes''' match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
#* "[ab,!]" matches "a", "b", "," or "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Escape sequences''' for common character sets (could be used both inside or outside character classes):
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands (have those meaning only outside character classes):
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of a language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the ''formal languages block''. You should choose the language you want to use, since lexem borders are different for different languages. For now it supports only two languages (but there will be more):
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You can enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:38:34Z

Oasychev: /* Next lexem hinting */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character sets:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
All characters and escape sequences work both inside and outside character sets, but POSIX classes are allowed only inside character sets [...]. Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of the language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually don't considered to be lexems, but a separators between them, since they don't have any particular meaning.

'''Next lexem hint''' will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type since 2.3 version allows use of next lexem hinting using ''formal languages block''. You should choose the language you use, since lexem borders are different for different languages. For now it supports only two languages, but there will be more:
* '''simple english''' - a simple lexer, that recognize words, numbers and punctuation;
* '''C/C++ language''' - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You could enter another word in the question description.

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:36:28Z

Oasychev: /* Hinting */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character sets:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
All characters and escape sequences work both inside and outside character sets, but POSIX classes are allowed only inside character sets [...]. Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red

=====Next character hinting=====
When next character hinting is available, student will have the '''hint next character''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

=====Next lexem hinting=====
'''Lexem''' means an atomic part of the language. For natural language a ''word'', a ''number'', a ''punctuation mark'' (or group of marks like '?!' or '...') are lexemes. For a programming language it's a ''keyword'', a ''variable name'', a ''constant'', an ''operator''. Note that spaces are usually don't considered to be lexems, but a separators between them, since they don't have any particular meaning.

Next lexem hint will show to the student either completion of current lexem (if partial match ends inside it) or next one (if student just complete current lexem). Like
are blue
or
are blue,
or
are blue, white

Preg question type since 2.3 version allows use of next lexem hinting using'' formal languages block''. You should choose the language you use, since lexem borders are different for different languages. For now it supports only two languages, but there will be more:
* simple english - a simple lexer, that recognize words, numbers and punctuation;
* C/C++ language - a programming language C (or C++).

Note that "lexem" typically isn't a word you would like you students to see on the hinting button. You could enter another

=====General hinting rules=====
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' ' (leading to the " and" path). The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for '''misplaced words''' (it will actually work only if anything else is correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:04:26Z

Oasychev: /* Non-deterministing finite state automata(NFA) */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character sets:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
All characters and escape sequences work both inside and outside character sets, but POSIX classes are allowed only inside character sets [...]. Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You could also have a rough search for '''misplaced words''' (it will actually work only if anything else would be correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor", looking for "am" doens't have "I" before ("(?!<I\s+)" part) and "victor" after ("(?!\s+victor)" part) it. "\s+" allows any number of spaces between words. If you want to catch first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which look for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually works for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since of Preg 2.3 you could combine hints and catching missing words. All you should provide is that answers that looks for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints from these answers, as they don't define correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions in main (hinting) regular expressions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T19:03:29Z

Oasychev: /* Looking for missing things */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that need escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter.
# '''POSIX character classes''' are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character sets:
#* <nowiki>"[[:alnum:]]"</nowiki> matches any alpha-numeric character;
#* <nowiki>"[[:^digit:]]"</nowiki> matches any non-digit chararcter.
All characters and escape sequences work both inside and outside character sets, but POSIX classes are allowed only inside character sets [...]. Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing and misplaced things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You could also have a rough search for '''misplaced words''' (it will actually work only if anything else would be correct) using syntax like this:
(?!<I\s+)\bam\b(?!\s+victor)
This expression catches misplaced "am" in the sentence "I am victor", looking for "am" doens't have "I" before ("(?!<I\s+)" part) and "victor" after ("(?!\s+victor)" part) it. "\s+" allows any number of spaces between words. If you want to catch first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like
(?!<^)\bI\b(?!\s+am)
which look for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually works for any given student response.

NFA and DFA matchers for now don't supports complex assertions, used by these regexes. Since of Preg 2.3 you could combine hints and catching missing words. All you should provide is that answers that looks for missing things (and other to give specific feedback) have a '''fraction''' (grade) lower, that '''hint grade border''' (see [[#Hinting]]). You actually don't want to generate hints from these answers, as they don't define correct situation, so it's no restriction actually.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T18:43:11Z

Oasychev: /* Regular expressions */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
#*'''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
#* If you have too many characters that needs escaping in some fragment, you could use '''\Q ... \E''' sequence instead. Anything between \Q and \E is treated literally as characters:
#** "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#** "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter;
Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it requires some complex work.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T18:36:36Z

Oasychev: /* Subpatterns and backreferences */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\". '''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter;
Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".
# The '''\Q...\E''' sequence (recognized both inside and outside character sets) is used for quoting substrings. Characters in between are treated as literals:
#* "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#* "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
# "(?<name1>...)" means a subpattern with name "name1";
# "(?'name2'...)" means a subpattern with name "name2";
# "(?P<name3>...)" means a subpattern with name "name3";
# "\k<name4>" means a backreference to the subpattern named "name4";
# "\k'name5'" means a backreference to the subpattern named "name5";
# "\g{name6}" means a backreference to the subpattern named "name6";
# "\k{name7}" means a backreference to the subpattern named "name7";
# "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it requires some complex work.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T18:36:02Z

Oasychev: /* Subpatterns and backreferences */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\". '''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
# '''Unicode properties''' are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
#* "\p{Ll}" matches any lowercase letter;
#* "\P{Lu}" matches any non-uppercase letter;
Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".
# The '''\Q...\E''' sequence (recognized both inside and outside character sets) is used for quoting substrings. Characters in between are treated as literals:
#* "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#* "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that ''remember'' substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
#* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
#* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
#* "(?<name1>...)" means a subpattern with name "name1";
#* "(?'name2'...)" means a subpattern with name "name2";
#* "(?P<name3>...)" means a subpattern with name "name3";
#* "\k<name4>" means a backreference to the subpattern named "name4";
#* "\k'name5'" means a backreference to the subpattern named "name5";
#* "\g{name6}" means a backreference to the subpattern named "name6";
#* "\k{name7}" means a backreference to the subpattern named "name7";
#* "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it requires some complex work.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2012-08-06T18:34:22Z

Oasychev: /* Duplicate subpattern numbers and names */

The Preg question type is a question type that uses regular expressions (regexes) to check student's responses. Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. This documentation contains a part about expressions in general, a part about the regular expressions as a particular case of expressions and finally a part about the Preg question type itself. If you are familiar with the regex syntax you may skip first parts and go to the [[#Usage of the Preg question type|usage of the Preg question type]] section. More details about the regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm.

Authors:
# Idea, design, question type and behaviours code, regex parsing and error reporting - Oleg Sychev.
# Regex parsing, DFA regex matching engine - Dmitriy Kolesov.
# Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore, subpatterns, backreferences and unicode support - Valeriy Streltsov.
We would gladly accept testers and contributors (see the [[#Development plans|development plans]] section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that was implemented in Preg question type.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''': '+' and '*'. The '''operands''' of '*' are 'y' and '2'. The '''operands''' of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite '''order of evaluation''', governed by operator's '''precedence'''. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: '''(x+y)*2''' will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their '''arity''' - this is just the number of operands required. In the example above '+' and '*' are '''binary''' operators - they both take two operands. Most of arithmetic operators are binary, but the minus has the '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each (arity), taking heed of their evaluation order by using their '''precedence''' and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
Regular expressions is a powerful mechanism for searching in strings using patterns. So their '''operands''' are characters or character sets. '''A''' is a regular expressions that matches a single character 'A'. The ways to define character sets are described below. The special characters that define operators should be '''escaped''' when used as operands - preceded by a backslash. These special characters are:

'''\ ^ $ . [ ] | ( ) ? * + { }'''

Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use ''any'' character as an operand.

====Operands====
Here's an incomplete list of operands that define character sets.
# '''Simple characters''' (with no special meaning) match themselves.
# '''Escaped special characters''' match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\". '''NOTE!''' when you are ''unsure'' whether to escape some character, it is safe to place "\" before any character except letters and digits. ''Do not'' escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
# '''Character sets''' match any character defined in them. Character sets are defined by brackets. The particular ways to define a character set are:
#* "[ab,!]" matches "a", "b", "," and "!";
#* "[a-szC-F0-9]" contains ranges (defined by a ''hyphen between 2 characters'') "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
#* "[^a-z-]" starts with the "^" that means a '''negative character set''': it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
#* "[\-\]\\]" contains ''escaping inside a character set'': it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
# '''Dot meta-character''' (".") matches ''any'' possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot.
# '''Escape sequences''' for common character sets:
#* "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
#* "\s" for any space character and "\S" for any non-space character;
#* "\d" for any digit and "\D" for any non-digit.
# '''Simple assertions''' - they are not characters, but conditions to test, they ''don't consume'' characters while matching, unlike other operands:
#* "^" matches in the start of the string, fails otherwise;
#* "$" matches in the end of the string, fails otherwise;
#* "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
#* "\B" matches not on a word boundary, negative to "\b".
Still, a pattern that matches only one character isn't very useful. So here comes the '''operators''' that allow us to define an expression which matches strings of several characters.

====Operators====
Here's a list of the common regex operators:
# '''Concatenation''' - so simple ''binary'' operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* "ab" matches "ab";
#* "a[0-9]" matches "a" followed by any digit, for example, "a5"
# '''Alternative''' - a ''binary'' operator that lets you define a set of alternatives:
#* "a|b" matches "a" or "b";
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "ab|cd|" matches "ab" or "cd" or ''emptiness'' (useful as a part in more complex expressions);
#* "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
# '''Quantifiers''' - an ''unary'' operator that lets you define repetition of something used as its operand:
#* "x*" matches "x" zero or more times;
#* "x+" matches "x" one or more times;
#* "x?" matches "x" zero or one times;
#* "x{2,4}" matches "x" from 2 to 4 times;
#* "x{2,}" matches "x" two or more times;
#* "x{,2}" matches "x" from 0 to 2 times;
#* "x{2}" matches "x" exactly 2 times;
#* "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".
# The '''\Q...\E''' sequence (recognized both inside and outside character sets) is used for quoting substrings. Characters in between are treated as literals:
#* "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
#* "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.

====Precedence and order of evaluation====
A '''Quantifier''' has precedence '''over concatenation''' and '''concatenation''' has precedence '''over alternative'''. Let's look what it means:
# ''quantifiers over concatenation'' means that quantifiers are executed first and will repeat only a single character if used without parentheses:
#* "ab*" matches "a" followed by zero or more "b";
#* "(ab)*" matches "ab" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
# ''concatenation over alternative'' means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
#* "ab|cd|de" matches "ab" or "cd" or "de";
#* "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical use of an empty alternative;
# ''quantifier over alternative'' means that you should use parentheses to repeat an alternative set:
#* "ab|cd*" matches "ab" or "c" followed by zero or more "d" like "cdddddd";
#* "(ab|cd)*" matches "ab" or "cd", repeated zero or more time in any order, like "ababcdabcdcd". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
#* "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", not just "aa" or "bb";
#* "a{2}|b{2}" matches "aa" or "bb" only.

====Subpatterns and backreferences====
'''Subpatterns''' are '''operators''' that 'remember' substrings captured by the regex. The simpliest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd.
Subpatterns are usually used with '''backreferences''' which, too, have numbers. Backreferences are '''operands''' that match the same strings which are matched by the subpatterns with the same numbers. The simplist syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did.
Constider a little example: declaration and initialization of an integer variable in C programming language:
#* "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
#* "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2 ; var2=123 ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:
#* "(?<name1>...)" means a subpattern with name "name1";
#* "(?'name2'...)" means a subpattern with name "name2";
#* "(?P<name3>...)" means a subpattern with name "name3";
#* "\k<name4>" means a backreference to the subpattern named "name4";
#* "\k'name5'" means a backreference to the subpattern named "name5";
#* "\g{name6}" means a backreference to the subpattern named "name6";
#* "\k{name7}" means a backreference to the subpattern named "name7";
#* "(?P=name8)" means a backreference to the subpattern named "name8".
This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

=====Duplicate subpattern numbers and names=====
There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

====Assertions====
Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:
* '''positive lookahead assertion''' "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
* '''negative lookahead assertion''' "a+(?!b)" matches any number of "a" that is not followed by "b";
* '''positive lookbehind assertion''' "(?<=b)a+" matches any number of "a" preceeded by "b";
* '''negative lookbehind assertion''' "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

====Matching====
Matching means finding a part of the student's answer that suits the regular expression. This part called '''match'''. You should enter regular expressions as '''answers''' to questions without modifiers or enclosing characters (modifiers will be added for you by the question - "u" is added always and "i" is added in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regex) to be shown to the student as '''correct answer'''. The question will use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) then a partial match that is the shortest to complete will be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine can't tell which one will be the shortest to complete.

====Anchoring====
Anchoring is used to set restrictions on the matching process by using simple assertions:
* if a regular expression starts with the '''^''' the match should start at the start of the student's response;
* if a regular expression ends with the '''$''' the match should end at the end of the student's reponse;
* otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:
* "^ab|cd$" will match "ab" from the start of the string or "cd" at the end of it;
* "^(ab|cd)$" using brackets to match exactly with "ab" or "cd";
* "^ab$|^cd$" is another way to get exact match (all top-level alternatives are anchored).

If you set the '''exact matching''' options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:
* "(?i)" will turn case-sensitivity off;
* "(?-i)" will turn case-sensitivity on.
This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:
* "abc(de(?i)'''gh''')xyz" will have the bold part case-insensitive;
* "abc(de)(?i)'''ghxyz'''" will have the bold part case-insensitive.

===Usage of the Preg question type===
Basically, this question type is an extended version of the Shortanswer.

====Notations====
Starting from Preg 2.1, the "notations" feature allows you to choose a notation in which regexes for answers will be written. The '''Regular expression''' which means Perl-compatible regex dialect is the default one. The exciting part of notations is that you can use the Preg question type just as improved shortanswer, having access to the hinting without any need to understand regular expressions! Just choose the '''Moodle shortanswer''' notation and you can just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA or DFA engine you can get access to the hinting. You can skip all that is said on regular expression topic there, but be sure to read the [[#Hinting|hinting]] section to understand various settings you can alter to configure you question hinting behaviour.

====Hinting====
Some matching engines support hinting (not an easy thing to do using PHP at all) in the adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching finds that response starts matching and on some character breaks it. Say you entered an expression:
'''are blue, white(,| and) red'''
and a student answered:
they are blue, vhite and red
Partial matching will find that the partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student's response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have the '''hint''' button by pressing which he receives a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question type doesn't add hinted characters to the student's response (unlike the regex question type), showing it separately instead for a number of reasons:
# it is student's responsibility whether he wants to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete the match, while ' ' leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with 50% then 100% grades will be used and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match in the student response.

====Subpattern capturing and feedback====
Any pair of parentheses in a regex are considered as a '''subpattern''' and when matching the engine remembers ('''captures''') not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely '''0''' is the whole regex, '''1''' is first subpattern etc. You can insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work on the ''general feedback'' because different answers can have different number of subpatterns.

'''PHP preg engine''' and '''NFA''' support full subpattern capturing. '''DFA''' engine can't do it by its nature, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
If you wrote the feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then a student entered
123.34
He will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

====Error reporting====
Native PHP preg extension functions only report if there is an error in regular expression or not, so '''PHP preg extension''' engine can't tell you much about the error.

'''NFA''' and '''DFA''' engines use a custom '''regular expression parser''', so they support the advanced error reporting. The are several classes of potential errors:
* more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
* unopened closing parenthesis "abc)";
* unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
* empty parentheses of any sort "(?=)";
* quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
* unclosed brackets of character classes "[a-fA-F\d";
* setting and unsetting the same modifier at the same time "(?i-i)";
* unknown unicode properties "\p{Squirrel}";
* unknown posix classes "[[:hamster:]]";
* unknown (*...) sequence "(*QWERTY)";
* incorrect ranges for quantifiers "a{5,4}" or character sets "[z-a]".

PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example, a quantifier {2,4} placed at the start of a regular expression loses the meaning as a quantifier and is treated as a five-characters sequence instead (that matches with the string "{2,4}"). However such syntax is very prone to errors and makes writing regular expression harder.

For now I vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE. If you stand for or against this decision then please write you positions and reasons to the comments. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

====Looking for missing things====
Joseph Rezeau's REGEXP question type has a '''missing words''' feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with '''negative assertions''' combined with anchoring the matching start. The regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there is no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it requires some complex work.

====Matching engines====
A matching engine means different program code that performs matching. There is no 'best' matching engine - it depends on the features you want to use and the regular expressions engine it should handle. They have a different degree of stability and offer different features to use.
=====PHP preg extension=====
It is based on the native PHP preg functions (which is in turn based on the PCRE library). It supports 100% perl-compatible regular expression features, it is very stable and thoroughly tested. Bot PHP functions doesn't support partial matching, so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it supports subpattern capturing. Choose it when you need complex regexp features that other engines don't support.

=====Deterministic finite state automata (DFA)=====
This is a custom PHP code that uses DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support can still differ from the standard. On the bright side it is support '''hinting'''.

Currently supported operands:
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)
* unicode properties

Currently supported operators:
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

=====Non-deterministing finite state automata(NFA)=====
NFA engine was introduced in the 2.1 release. It is a custom matcher that can do everything that DFA matcher can, but also supports:
* subpattern capturing (including named subpatterns, duplicate subpatterns numbers)
* backreference capturing (including named backreferences)

So, you don't have to choose between hiting and subpattern capturing in you questions - NFA can do them both! Also, the NFA matcher is more stable than the DFA one and it is probably the best choise if you want to use partial matching with hinting, but without lookaround assertions.

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not free much time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't a great programmer, you just need to know regular expressions - contact me and I'll tell you how).

===Development plans===
There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improve simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop the backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2011-12-23T14:35:56Z

Oasychev:

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, behaviours code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
# NFA regular expression matching engine, backreferences, cross-testing, backup&restore - Valeriy Streltsov
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time. Thanks to Joseph Rezeau for been devoted tester of Preg question type releases and original authors of many ideas, that was implemented in Preg question type.

===Notations===
Starting from Preg 2.1, notation feature allows you to choose notation in which regular expressions for answers will be written. '''Regular expression''' is a default one.

One exciting part of it is that you could use preg question type just as improved shortanswer, having access to the hinting facility without any need to understand regular expressions at all! Just choose '''Moodle shortanswer''' notation and you could just copy answers from you shortanswer questions. '*' wildcard is supported. Choosing NFA or DFA engine you could get access to the hinting. You could omit all that is said on regular expression topic there, but be sure to read [[#Hinting|hinting section]] below to understand various settings you could alter to configure you question hniting behaviour.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
#* when '''unsure''' whether to escape some character, it is safe to place \ before any character except letters, digits and underscore, so you don't need to worry whether particular symbol is special or no
#* do not escape letters unless you know what you are doing, since they get special meaning when escaped and lose it when not
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.
# ''escape sequences'' for common character classes:
#* '''\w''' for any word character (letter, underscore or digit) and '''\W''' for non-word character
#* '''\s''' for any space character and '''\S''' for any non-space character
#* '''\d''' for any digit and '''\D''' for any non-digit
# ''simple assertions'' - these are not characters, but conditions to test, they don't "eat" characters while matching, unlike other operands:
#* '''^''' - match in the start of response string, fails otherwise
#* '''$''' - match in the end of response string, fails otherwise
#* '''\b''' - match on the word boundary, i.e. between word (\w) and non-word (\W) characters, and in the start/end of the response if it starts (ends) with word character

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process using simple assertions:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, which make it's use slightly tricky.
#* '''^ab|cd$''' would match ''ab'' from the start of string or ''cd'' at the end of it
#* '''^(ab|cd)$''' use brackets to match exactly with ''ab'' or ''cd''
#* '''^ab$|^cd$''' is another way to get exact matching of regex with top-level alternatives

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you (with correct brackets to not interfere with you subpatterns). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

====Local case-sensitivity modifiers====
Starting from Preg 2.1 you could set case-(in)sensitivity for parts of you regular expression using standard syntax of Perl-compatible regular expressions:
#* '''(?i)''' would turn case-sensitivity off
#* '''(?-i)''' would turn case-sensitivity on

This affects general case-sensitivity, which is choosen on the question level. So you could make some answer case-sensitive while other as not, or even do this for the parts of answers. For example you could set question as "use case" and have 50% answer regular expression starting with ''(?i)'' to lessen grade when case don't match, but answer otherwise correct.

When done in the round brackets, these local modifiers work up to the closest ''')''' (closing round bracket). When done on the main level (not inside brackets) they work up to the end of expression. I.e. with case sensitivity on for the question:
* abc(de(?i)'''gh''')xyz would have bold part case-insensitive
* abc(de)(?i)'''ghxyz''' would have bold part case-insensitive

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student).

Similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax that is specific to Jospeh Rezeau REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it stil some time away, requiring further complex work.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine, introduced in 2.1 release, is a custom matcher that basically could do anything DFA matcher could, with addition of subpattern capturing and backreferences.

Now you don't have to choose between hiting and using captured subpatterns in you questions: NFA could do them both!

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing you use of Preg question I could give reference for would improve rating of preg project there and my rating as researcher/developers, so please publish and let me know the reference if you feel grateful for the software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what you organisation is and how you use preg - that'll help a little and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't great programmer, you just need to know regular expressions - contact me and I'll tell you how.

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improved simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.
* Improve Unicode support of custom matching engines

[[Category:Contributed code]]

Preg question type

2011-12-13T10:33:38Z

Oasychev:

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, behaviours code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
# NFA regular expression matching engine - Valeriy Streltsov
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time. Thanks to Joseph Rezeau for been devoted tester of Preg question type releases and original authors of many ideas, that was implemented in Preg question type.

===Notations===
Starting from Preg 2.1, notation feature allows you to choose notation in which regular expressions for answers will be written. '''Regular expression''' is a default one.

One exciting part of it is that you could use preg question type just as improved shortanswer, having access to the hinting facility without any need to understand regular expressions at all! Just choose '''Moodle shortanswer''' notation and you could just copy answers from you shortanswer questions. '*' wildcard is supported. Choosing NFA or DFA engine you could get access to the hinting. You could omit all that is said on regular expression topic there, but be sure to read [[#Hinting|hinting section]] below to understand various settings you could alter to configure you question hniting behaviour.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student).

Similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax that is specific to Jospeh Rezeau REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it stil some time away, requiring further complex work.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine, introduced in 2.1 release, is a custom matcher that basically could do anything DFA matcher could, with addition of subpattern capturing and backreferences.

Now you don't have to choose between hiting and using captured subpatterns in you questions: NFA could do them both!

===The ways to give back===
I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this thought. Some examples:
* publishing a thesis or paper describing you use of Preg question I could give reference for would improve rating of preg project there and my rating as researcher/developers, so please publish and let me know the reference if you feel grateful for the software;
* if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would '''help even more''' - please inform me immediately if you consider this;
* if publishing is hard, you could just write me what you organisation is and how you use preg - that'll help a little and I would be able to better determine what should be done next;
* join the testing efforts, either by performing manual test or by writing unit tests (it's easy to do even if you aren't great programmer, you just need to know regular expressions - contact me and I'll tell you how.

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improved simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.
* Improve Unicode support of custom matching engines

[[Category:Contributed code]]

Preg question type

2011-12-13T10:20:32Z

Oasychev:

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, behaviours code, regex parsing and error reporting - Oleg Sychev;
# NFA regular expression matching engine - Valeriy Streltsov
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time. Thanks for Joseph Rezeau for been devoted tester of Preg question type releases and original authors of many ideas, that was implemented in preg question type.

===Notations===
Starting from Preg 2.1, notation feature allows you to choose notation in which regular expressions for answers will be written. '''Regular expression''' is a default one.

One exciting part of it is that you could use preg question type just as improved shortanswer, having access to the hinting facility without any need to understand regular expressions at all! Just choose '''Moodle shortanswer''' notation and you could just copy answers from you shortanswer questions. '*' wildcard is supported. Choosing NFA or DFA engine you could get access to the hinting. You could omit all that is said on regular expression topic there, but be sure to read [[#Hinting|hinting section]] below to understand various settings you could alter to configure you question hniting behaviour.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student).

Similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax that is specific to Jospeh Rezeau REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it stil some time away, requiring further complex work.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine, introduced in 2.1 release, is a custom matcher that basically could do anything DFA matcher could, with addition of subpattern capturing and backreferences.

Now you don't have to choose between hiting and using captured subpatterns in you questions: NFA could do them both!

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improved simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.
* Improve Unicode support of custom matching engines

[[Category:Contributed code]]

Discussió:Preg question type

2011-12-13T10:19:56Z

Oasychev: /* Looking for missing things */

=== Looking for missing things ===

''Joseph Rezeau REGEXP question type has a special syntax for missing words feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student).''

Please note that, starting with the version for moodle 2.1, my REGEXP question type accepts negative assertions (such as used in PREG question type) as well as the proprietary '''--.*\bnecessary\b.*''' syntax.

@Oleg: you may wish to change your documentation to reflect this change in REGEXP.

--[[User:Joseph Rézeau|Joseph Rézeau]] 01:41, 21 November 2011 (WST)

Thanks, Joseph, I hope you'll like the changes I've done. Please sent Moodle Message to me directly if there would be similar related changes in the future, as I don't have much time and rarely monitor documentation wiki talks page.
--[[User:Oleg Sychev|Oleg Sychev]] 18:19, 13 December 2011 (WST)

Preg question type

2011-12-13T10:16:59Z

Oasychev:

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, behaviours code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
# NFA regular expression matching engine - Valeriy Streltsov
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time. Thanks for Joseph Rezeau for been devoted tester of Preg question type releases and original authors of many ideas, that was implemented in preg question type.

===Notations===
Starting from Preg 2.1, notation feature allows you to choose notation in which regular expressions for answers will be written. '''Regular expression''' is a default one.

One exciting part of it is that you could use preg question type just as improved shortanswer, having access to the hinting facility without any need to understand regular expressions at all! Just choose '''Moodle shortanswer''' notation and you could just copy answers from you shortanswer questions. '*' wildcard is supported. Choosing NFA or DFA engine you could get access to the hinting. You could omit all that is said on regular expression topic there, but be sure to read [[#Hinting|hinting section]] below to understand various settings you could alter to configure you question hniting behaviour.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {$2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student).

Similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''. Don't try '--' syntax that is specific to Jospeh Rezeau REGEX question type!

Sadly, no engine except PHP_preg_matcher is supportting complex assertions (i.e. '''(?!''' part). We are working on it, but it stil some time away, requiring further complex work.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

====Non-deterministing finite state automata(NFA)====
NFA engine, introduced in 2.1 release, is a custom matcher that basically could do anything DFA matcher could, with addition of subpattern capturing and backreferences.

Now you don't have to choose between hiting and using captured subpatterns in you questions: NFA could do them both!

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Improved simple assertions support
* Support for complex assertions
* Hinting not one character, but completion of the whole world
* Add automatic generation of shortest possible correct answer in user-readable form
* Add a set of authoring tools to make writing regular expressions easier
* Develop backtracking matching engine
* Develop more help and examples for the people that don't know much about regular expressions.
* Improve Unicode support of custom matching engines

[[Category:Contributed code]]

Fitxer:Bulk edit4.JPG

2011-05-04T21:07:20Z

Oasychev: uploaded a new version of "File:Bulk edit4.JPG"

Bulk module editing step 4: edit resulting form

Linked activities

2011-05-04T20:31:54Z

Oasychev: /* Interface */

In large courses we often find a sets of similar activities (quizzes, assignments and so on), which share many settings (i.e. module settings) - but not all, of course. Maintaining of such sets is quite a pain: imagine youself adding a new IP range on 6-10 quizzes. So we need an usable and robust way to handle such sets - linked activities.

This page is to discuss and find a better way to do so.

Let's call the '''setting''' some parameter of course module instance. Setting isn't necessary one control on the form (it may be a group of related controls, check boxes for example). Setting may be a field in db, or not - needs thinking.

A '''set''' is a group of course module instances, that share common settings.

== Issues ==
* linked activities may need to response on editing one of them (and deleting), so they need an events on these occasion. Anyone interested please vote for MDL-16203.
* modules now handle settings (from form to db and from db to form) in a bulk, there is no way to say the module "please save in db (or show on the form) this setting and not leave others be". It's possible to do this without module help, but if module can map individual settings to db/form(controls) it would help.

== Interface ==

=== Level 0: Save as another ===
Done in Moodle 2.1 as copy acitivity.

=== Level 1: Bulk activities editing ===
This doesn't require DB change, and hopefully can be done in 1.9 too.
Block bulk_module_edit is intented to be placed in a course page.

==== Step 1: Select a module type to edit ====
[[File:Bulk_edit1.JPG]]

==== Step 2: Select instances to edit ====
[[File:Bulk_edit2.JPG]]

==== Step 3: Select fields to edit ====
[[File:Bulk_edit3.JPG]]
Sorry, quiz has quite a long list of form fields which doesn't fit in screenshot.

==== Step 4: Edit and save form ====
[[File:Bulk_edit4.JPG]]

=== Level 2: Sets of activities ===
On this level system would be able to store sets of related instances and provide a one-click link to edit them. This'll require new db tables, so it probably 2.0 only.

There must be a block which displays a sets as a links to edit them, and new controls on index.php page to create (manage?) sets.

If events on update instance would be fired, than it is possible to have sets which are automatically update some settings whenever one of the activities in set are updated.

== Architecture ==

Fitxer:Bulk edit4.JPG

2011-05-04T20:30:58Z

Oasychev: Bulk module editing step 4: edit resulting form

Bulk module editing step 4: edit resulting form

Fitxer:Bulk edit3.JPG

2011-05-04T20:30:36Z

Oasychev: Bulk module editing step 3: selecting fields to bulk edit

Bulk module editing step 3: selecting fields to bulk edit

Fitxer:Bulk edit2.JPG

2011-05-04T20:29:09Z

Oasychev: Bulk module editing step 2: selecting instances

Bulk module editing step 2: selecting instances

Fitxer:Bulk edit1.JPG

2011-05-04T20:26:32Z

Oasychev: Bulk module editing block step 1: selecting module

Bulk module editing block step 1: selecting module

Broken/Process

2010-12-09T14:51:44Z

Oasychev: /* If the issue is a bug in the current stable version requiring database changes, assign "Fix version" to DEVBACKLOG */ new section

I think 'User' should be described as both 'finding issues' and 'suggesting improvements' for Moodle, even if the detail of how things are implemented are left for the Product Owner. Just thinking that a good chunk of the things I add to the Tracker are suggestions for improvement rather than 'issues' as such and that we should promote this idea as a general rule - maybe :)

:I agree with Mark. Bug tracker should be used for actual bugs. Discussion about future roadmaps, enhcnements and improvements is a separate matter, and in actual fact it is hard to engage in this. I don't have an answer at the moment. Maybe a place to discuss, and notification of when discussions have heated up (like specs are being prepared and a new roadmap developed for a component) ie owners manage the discussion, but there is a clear place to go to engage, some indication of timelines etc. --[[User:Derek Chirnside|Derek Chirnside]] 19:39, 25 November 2010 (UTC)

Well, "issues" there was meant to cover "bugs" and "suggestions", as the tracker does both. I'll make it more explicit though. [[User:Martin Dougiamas|Martin Dougiamas]] 10:12, 30 November 2010 (UTC)

OK - I could live with this. But I still prefer this clearer distinction as suggested by Mark:

===User role===

# Uses Moodle
# Finds issues/bugs (report in tracker)
# Suggests improvements (report in tracker)

Question: is there a way in the tracker to keep issues in two lists: bugs and suggestions? Maybe even like Google: you can suggest anything, but at any given time there are a few suggestions up for vote? In Moodle, you can discuss anything, but someone is highlighting at any given time a few topics for focused discussions. On reflection the answer may be No - on this basis I am back to my original suggestion.

A bug is a bug - it is not working as we know it should. A suggestion for an enhancement is partly an invitation to dialogue, vote. Voting for bugs is silly - all bugs need fixing, and priorities are best (IMO) determined centrally. So:

#Uses Moodle
#Finds and reports bugs (use tracker)
#Suggests improvements (use tracker)
#Takes part in dialogue around suggested improvements

[[User:Derek Chirnside|Derek Chirnside]] 06:18, 4 December 2010 (UTC)

:Derek, thanks for your comments. Regarding a way in the tracker to keep issues in two lists: bugs and suggestions, when you create an issue you have a choice of issue types - bug, new feature, task and improvement. Thus, a search for all new features and improvements should generate a list of suggestions. --[[User:Helen Foster|Helen Foster]] 14:31, 4 December 2010 (UTC)

== Backlog naming ==

Regarding the latest change about naming backlog versions with "STABLEBACKLOG/DEVBACKLOG", I'd recommend to use instead something like: "1.9.x backlog/2.0.x backlog/2.1.x backlog" because:

# It saves us to move things when a new major release happens (so it won't be necessary to move all the DEVBACKLOG => STABLEBACKLOG".
# It supports '''multiple''' stable branches, like we have now (1.9.x, 2.0.x...)
# It respects the format used by both the Affected Branches and Fixed Branches custom fields that are really useful for a lot of filters.

Ciao, [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 23:33, 6 December 2010 (UTC) :-)

:Addenda: Finally it has been decided to go to 2 backlogs only (stable/dev). Fair enough so developer (team) will look to the real branches were solution needs to be implemented. [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 10:23, 9 December 2010 (UTC)

==This does not look like scrum to me at all==

I think we need a certified scrum master. This proposal IMHO seems to break nearly all the good Scrum practises described in books.

Ciao, [[User:Petr Škoda (škoďák)|Petr Škoda (škoďák)]] 10:03, 9 December 2010 (UTC)

:Agree! some points (see point 3 especially, both in STABLE/DEV teams, break the thing. It's (scrum, by team) master responsibility to discuss with product owner, not team itself! Isolation is a MUST.

: Ciao, [[User:Eloy Lafuente (stronk7)|Eloy Lafuente (stronk7)]] 10:12, 9 December 2010 (UTC)

----

== If the issue is a bug in the current stable version requiring database changes, assign "Fix version" to DEVBACKLOG ==

Hmm, maybe this should be governed up by bug severity too? You, sure, aren't going to say that every bug requiring DB update, however severe it is, should be left up to the next major release? --[[User:Oleg Sychev|Oleg Sychev]] 14:51, 9 December 2010 (UTC)

VSTU projects

2010-11-13T18:51:26Z

Oasychev: /* Blocks */

==Lead: Sychev Oleg Aleksandrovich==
Senior Lecturer of POAS (Software Engineering) department

===Core patches===
# [[Development:Categories editing interface improvment|Improvment of editing interface for category list]]. Moving category anywhere should require only one page reload. Developers: Sychev Oleg, Shkarupa Alex. Status: waiting reviewing!
# [[Development:Javascript-interface for repeat_elements function|Javascript-interface for repeat_elements function]] - adding new blanks to the form shoudn't require page reloads.
# [[Development:Forum thread subscription|Forum thread subscription]] - a long standing request to be able to subscript only to the thread, not whole forum.

===Activity modules===
# [[Development:Assignment development|New assignment-replacement module]] with support for individual tasks (with custom fields), several grading criterions, automatic graders etc. Developers: Sychev Oleg, Erofeev Anatolius.
# [[Development:Subcourse module improvments|Subcourse module improvments]] - improve Subcourse module to get rid of buggy Metacourses and make it actually useful.

===Question types===
# [[Preg question type]] - regular expression question type, developed on more solid algorithmic basis than regex one. Developers: Sychev Oleg, Kolesov Dmitry.
# [[Development:Auto-feedback shortanswer question|Auto-feedback shortanswer question]] shortanswer question which could be able to detect typos and misplaced, missing or extra words and give appropriate feedback.

===Blocks===
# [[Development:Linked activities|Linked activities]] to allow editing properties of multiple activities at once.
# [[Auto role assignment block]] is useful to temporary restrict (or enhance) user abilities while doing something.
# [[Supervised block]] is much better than IP filter+time control if you want students to be able to do something only under teacher supervision.

===Libraries===
# [[Development:Auto-backup library|Auto-backup library]] - adding very small information to install.xml files it is possible to automate many tedious tasks of programming backup/restore for the plugins.

Preg question type

2010-10-25T11:41:21Z

Oasychev: /* Deterministic finite state automata (DFA) */

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

For now new code with all this functionality located in the HEAD branch of preg question type (you could also download it from the [http://moodle.org/mod/data/view.php?d=13&rid=1901&filter=1 Modules and Plugins database] using 'latest version' link. It is works with Moodle 2.0. The code considered BETA quality - use it with care! If you find any bugs please report them on the tracker.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a special syntax for '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student). So if you want to look out for the missing word '''necessary''' in the response, you'll add this answer (WARNING - REGEXP only syntax on the next line):
--.*\bnecessary\b.*
where \b defines a word boundary, while .* ensures that this word could be anywhere in the response.

There is no need to have such features in the PREG question type, since similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Equivalent regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (locale-aware, but not Unicode for performance reasons, as in standard regular expression functions)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operators DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2010-10-25T11:30:53Z

Oasychev: /* Development plans */

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

For now new code with all this functionality located in the HEAD branch of preg question type (you could also download it from the [http://moodle.org/mod/data/view.php?d=13&rid=1901&filter=1 Modules and Plugins database] using 'latest version' link. It is works with Moodle 2.0. The code considered BETA quality - use it with care! If you find any bugs please report them on the tracker.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a special syntax for '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student). So if you want to look out for the missing word '''necessary''' in the response, you'll add this answer (WARNING - REGEXP only syntax on the next line):
--.*\bnecessary\b.*
where \b defines a word boundary, while .* ensures that this word could be anywhere in the response.

There is no need to have such features in the PREG question type, since similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Equivalent regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (WARNING: they probably will not work for Unicode (non-latin) characters for now)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operators DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2010-10-25T11:30:12Z

Oasychev: Adding description how to look for missing words etc.

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

For now new code with all this functionality located in the HEAD branch of preg question type (you could also download it from the [http://moodle.org/mod/data/view.php?d=13&rid=1901&filter=1 Modules and Plugins database] using 'latest version' link. It is works with Moodle 2.0. The code considered BETA quality - use it with care! If you find any bugs please report them on the tracker.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Looking for missing things===
Joseph Rezeau REGEXP question type has a special syntax for '''missing words''' feature, allowing to define an answer that would work when something is absent in the answer (and give appropriate feedback to the student). So if you want to look out for the missing word '''necessary''' in the response, you'll add this answer (WARNING - REGEXP only syntax on the next line):
--.*\bnecessary\b.*
where \b defines a word boundary, while .* ensures that this word could be anywhere in the response.

There is no need to have such features in the PREG question type, since similar effect could be achieved with '''negative assertions''' combined with anchoring the matching start. Equivalent regular expression to look for the missing word '''necessary''' would be
^(?!.*\bnecessary\b.*)
where
* '''(?!.*\bnecessary\b.*)''' is a '''negative lookahead assertion''', that allows matching only if there are no word '''necessary''' ahead of some point in the string;
* '''^''' is an assertion too, that anchoring the match to the start of response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case the description is difficult to you, just surround regexp to be missing with '''^(?!''' and ''')'''.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (WARNING: they probably will not work for Unicode (non-latin) characters for now)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operations DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add negative matching answers (AKA "missing words" from Joseph Rezeau question) - depends on developing and checking in extra_answer_fields() code
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Broken/Forum thread subscription

2010-10-15T22:20:58Z

Oasychev:

I didn't think adding new option for discussion subscription is good. The options for forum and discussion subscription are not orthogonal. How should behave forum with disabled forum subscription and auto thread subscriptions? Or forced forum subscriptions and optional thread ones? It's better to add new subscription modes than adding separate option.
--[[User:Oleg Sychev|Oleg Sychev]] 22:20, 15 October 2010 (UTC)

Broken/Javascript-interface for repeat elements function

2010-10-13T21:43:15Z

Oasychev:

1. Consider optionally sending AJAX request to the server to inform it of the adding blanks if it will be found really necessary. --[[User:Oleg Sychev|Oleg Sychev]] 21:43, 13 October 2010 (UTC)

Javascript-interface for repeat elements function

2010-10-13T21:41:47Z

Oasychev: /* Moving blanks */ - clearifying example

'''GOAL''': improve usability by using dynamic form editing.

'''EXAMPLE''': let's imagine you have a multichoice question with 8 or more choices and the order of choices matters. Imagine you found you want add new choice at the start or delete first choice. With current interface you are in a BIG trouble.

At this moment form editing, especially adding new blank answers, units, choices and fields requires page reloading.
Editing blanks order or blank removing is not possible without page reloading.

'''TASK''': The main task is development of JavaScript-interface for repeat_elemens function. Other task is make ADD, REMOVE and MOVE actions without page reloading. If JavaScript is not available in user's browser this actions must work with page reloading, like it works now.

== Where it will be. ==
Here, there are some pages, where this function uses, in the list below :
* Course editing -> Adding a new Choice - Adding fields to form;
* Course editing -> Adding a new Quiz - Adding new feedback fields;
* Course editing -> Editing Quiz-> Editing a calculated question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a calculated question - Adding Blanks for Units;
* Course editing -> Editing Quiz-> Editing a Calculated Multichoice question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Simple Calculated question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Simple Calculated question Adding Blanks for Units;
* Course editing -> Editing Quiz-> Editing a Matching question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Multiple choice question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Numerical question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Numerical question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Adding a short answer question - Adding Blanks for choices;
etc.
As repeat_elements is core function, it should work anywhere.

This core patch will add new buttons to the form interface.
http://imglink.ru/pictures/21-09-10/f85258328f02d65164e6a15644e5d8a9.jpg
[[Image:7.jpg]]

== Adding new blanks ==

Adding new blanks realize by pushing "Add blanks for SMTH" button.
There will be a hidden empty blank. When user will push the button, it's duplicate will be added in the end of blank's list.

== Removing blanks. ==

It's possible to remove only 'Empty' blanks. Empty blank is that blank, where all key-fields are empty or have default values.
List of elements, which will be check:
*FILE.
File upload input box with browse button. Empty, when filepath is empty.
*FILEPICKER
General replacement of file element in Moodle 2.0 . Empty, when filepath is empty.
*HTMLEDITOR
Empty when it doesn't contains eny content.
*PASSWORD and PASSWORDUNMASK
A password element. Empty, when it doesn't contains any text.
TEXT
Simple text input element. Empty, when it doesn't contains any text.
*TEXTAREA
A textarea element. Empty, when it doesn't contains any text.

Other form elements like DATEPICKER, RECAPTCHA, SELECT, HIDDEN and others can contain's any values, becuse they won't be check.

Blank's removing realize by pushind remove button(see screenshot) and confirming your choise.
If blank's removing isn't possible user will get allert with reason.

== Moving blanks ==

Blanks moving is available by dragging them with move button.
Blank's original order will be reported on saving , by using hidden elements, named "ord", which will save order number of each blank. You could use it to determine where each element is go (even if edited). This is useful, for example, when editing question with existing attempts: saving procedure would know new indexes of all previous answers even if they are moved AND their content edited, so it could ajust student's responses accordingly.

For example:
Creation:
We've got 3 empty blanks. Values of their "ords" are -1,-1,-1.
Editing:
We edit all blanks. We've got 3 blanks with ords -1,-1,-1.
Saving:
We save question(or smth else), Ords: -1,-1,-1
Loading:
Ords: 0,1,2
Removing:
We remove blank, from the middle of blank-list. We've got 2 blanks with ords 0,2.
Adding:
We add two blanks. We've got ords: 0,2,-1,-1
Moving:
We move last blank to be first. Ords: -1,0,2,-1

So, the saving procedure would know that answer number 1 is deleted and answers number 0 and 2 changed their places.

== Function Interface ==

Interface of repeat_elements function will receive 7 params instead of 8 in original version.
New param $prefix gets prefix for additional form elements, added by the repeat_elements function.
In original vewrsion this names were separated and saved in two params: $repeathiddenname and $addfieldsname.
This preliminary change, simplifying the interface and helps to solve the basic problem without changing it.

Javascript-interface for repeat elements function

2010-10-13T21:35:40Z

Oasychev: Example where it useful set

'''GOAL''': improve usability by using dynamic form editing.

'''EXAMPLE''': let's imagine you have a multichoice question with 8 or more choices and the order of choices matters. Imagine you found you want add new choice at the start or delete first choice. With current interface you are in a BIG trouble.

At this moment form editing, especially adding new blank answers, units, choices and fields requires page reloading.
Editing blanks order or blank removing is not possible without page reloading.

'''TASK''': The main task is development of JavaScript-interface for repeat_elemens function. Other task is make ADD, REMOVE and MOVE actions without page reloading. If JavaScript is not available in user's browser this actions must work with page reloading, like it works now.

== Where it will be. ==
Here, there are some pages, where this function uses, in the list below :
* Course editing -> Adding a new Choice - Adding fields to form;
* Course editing -> Adding a new Quiz - Adding new feedback fields;
* Course editing -> Editing Quiz-> Editing a calculated question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a calculated question - Adding Blanks for Units;
* Course editing -> Editing Quiz-> Editing a Calculated Multichoice question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Simple Calculated question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Simple Calculated question Adding Blanks for Units;
* Course editing -> Editing Quiz-> Editing a Matching question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Multiple choice question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Numerical question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Editing a Numerical question - Adding Blanks for choices;
* Course editing -> Editing Quiz-> Adding a short answer question - Adding Blanks for choices;
etc.
As repeat_elements is core function, it should work anywhere.

This core patch will add new buttons to the form interface.
http://imglink.ru/pictures/21-09-10/f85258328f02d65164e6a15644e5d8a9.jpg
[[Image:7.jpg]]

== Adding new blanks ==

Adding new blanks realize by pushing "Add blanks for SMTH" button.
There will be a hidden empty blank. When user will push the button, it's duplicate will be added in the end of blank's list.

== Removing blanks. ==

It's possible to remove only 'Empty' blanks. Empty blank is that blank, where all key-fields are empty or have default values.
List of elements, which will be check:
*FILE.
File upload input box with browse button. Empty, when filepath is empty.
*FILEPICKER
General replacement of file element in Moodle 2.0 . Empty, when filepath is empty.
*HTMLEDITOR
Empty when it doesn't contains eny content.
*PASSWORD and PASSWORDUNMASK
A password element. Empty, when it doesn't contains any text.
TEXT
Simple text input element. Empty, when it doesn't contains any text.
*TEXTAREA
A textarea element. Empty, when it doesn't contains any text.

Other form elements like DATEPICKER, RECAPTCHA, SELECT, HIDDEN and others can contain's any values, becuse they won't be check.

Blank's removing realize by pushind remove button(see screenshot) and confirming your choise.
If blank's removing isn't possible user will get allert with reason.

== Moving blanks ==

Blanks moving is available by dragging them with move button.
Blank's original order will be preserved , by using hidden elements, named "ord", which will save order number of each blank.

For example:
Creation:
We've got 3 empty blanks. Values of their "ords" are -1,-1,-1.
Editing:
We edit all blanks. We've got 3 blanks with ords -1,-1,-1.
Saving:
We save question(or smth else), Ords: -1,-1,-1
Loading:
Ords: 0,1,2
Removing:
We remove blank, from the middle of blank-list. We've got 2 blanks with ords 0,2.
Adding:
We add two blanks. We've got ords: 0,2,-1,-1
Moving:
We move last blank betwen first and second. Ords: 0,-1,2,-1

So, when we create or add new blanks, initial values of their ords are -1. When we load blanks, which were saved earlier, their ord's values are 0,1,2,3...,n .

== Function Interface ==

Interface of repeat_elements function will receive 7 params instead of 8 in original version.
New param $prefix gets prefix for additional form elements, added by the repeat_elements function.
In original vewrsion this names were separated and saved in two params: $repeathiddenname and $addfieldsname.
This preliminary change, simplifying the interface and helps to solve the basic problem without changing it.

Broken/Javascript-interface for repeat elements function

2010-10-02T21:27:16Z

Oasychev:

# "Where it will be" - don't think full list is necessary, you may just list some places. Core function should work anywhere it called, calling isn't you problems.
# I don't think using AJAX to add new blanks is a good idea. Why don't just add them using javascript on the page, creating new HTML-elements (you'll need a hidden empty blank somewhere ready to copying) and changing hidden element with the number of blanks? The server is stateless, it won't find anything strange when the modified form will be submitted.
# Last sentences about deleting/moving isn't necessary, you are talking with the smart people, they sure know what those icon do. At least make pictures smaller - or just delete all this.
# Still nothing about preserving original numbers of the blanks and proposed changes in the repeat_elements interface to set $prefix instead of two existing (and one new) element names...
# I don't see any "restore order" button on the screenshots, I also don't sure why it is necessary while it definitely wouldn't be easy thing to implement. If you implement all other things that will be quite enought.
--[[User:Oleg Sychev|Oleg Sychev]] 21:27, 2 October 2010 (UTC)

Broken/Forum thread subscription

2010-10-01T13:30:01Z

Oasychev: New page: # when showing interface screens please don't show full screen - cut a part you are changing and area about them for context (when changing block insert just this block) - look at [[Develo...

# when showing interface screens please don't show full screen - cut a part you are changing and area about them for context (when changing block insert just this block) - look at [[Development:Categories_editing_interface_improvment]] for examples
# write about what changes will be made in forum options form (new options for subscriptions modes, for user auto-subscription mode too)
# write about how you patch will work when subscription mode for the forum changes
--[[User:Oleg Sychev|Oleg Sychev]] 13:30, 1 October 2010 (UTC)

Broken/Javascript-interface for repeat elements function

2010-10-01T13:21:35Z

Oasychev: New page: I already told you what I think about this page: # one screenshot is quite enought for what you done, it isn't necessary to add screenshot to every alert message box - this is not Microsof...

I already told you what I think about this page:
# one screenshot is quite enought for what you done, it isn't necessary to add screenshot to every alert message box - this is not Microsoft User Manual, just a sentence that this will be done is enought;
# one more screenshot needed to show how it would look when blank contains exactly one string, without any fieldsets around it
# write out thoroughtly, which field types you would check for been non-empty when deleting
# write how you would remember initial positions of the blanks and return them after editing
# and mind you spelling, correct "choise" all over the place.
Now I am wondering why it still not done...
--[[User:Oleg Sychev|Oleg Sychev]] 13:21, 1 October 2010 (UTC)

Preg question type

2010-09-29T20:46:37Z

Oasychev: /* Precedence and order of evaluation */ - correcting a typo in the answer

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

For now new code with all this functionality located in the HEAD branch of preg question type (you could also download it from the [http://moodle.org/mod/data/view.php?d=13&rid=1901&filter=1 Modules and Plugins database] using 'latest version' link. It is works with Moodle 2.0. The code considered BETA quality - use it with care! If you find any bugs please report them on the tracker.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more d's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (WARNING: they probably will not work for Unicode (non-latin) characters for now)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operations DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add negative matching answers (AKA "missing words" from Joseph Rezeau question) - depends on developing and checking in extra_answer_fields() code
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

VSTU projects

2010-09-21T16:20:36Z

Oasychev: /* Blocks */

==Lead: Sychev Oleg Aleksandrovich==
Senior Lecturer of POAS (Software Engineering) department

===Core patches===
# [[Development:Categories editing interface improvment|Improvment of editing interface for category list]]. Moving category anywhere should require only one page reload. Developers: Sychev Oleg, Shkarupa Alex. Status: waiting reviewing!
# [[Development:Javascript-interface for repeat_elements function|Javascript-interface for repeat_elements function]] - adding new blanks to the form shoudn't require page reloads.
# [[Development:Forum thread subscription|Forum thread subscription]] - a long standing request to be able to subscript only to the thread, not whole forum.

===Activity modules===
# [[Development:Assignment development|New assignment-replacement module]] with support for individual tasks (with custom fields), several grading criterions, automatic graders etc. Developers: Sychev Oleg, Erofeev Anatolius.
# [[Development:Subcourse module improvments|Subcourse module improvments]] - improve Subcourse module to get rid of buggy Metacourses and make it actually useful.

===Question types===
# [[Preg question type]] - regular expression question type, developed on more solid algorithmic basis than regex one. Developers: Sychev Oleg, Kolesov Dmitry.
# [[Development:Auto-feedback shortanswer question|Auto-feedback shortanswer question]] shortanswer question which could be able to detect typos and misplaced, missing or extra words and give appropriate feedback.

===Blocks===
# [[Development:Linked activities|Linked activities]] to allow editing properties of multiple activities at once.
# [[Auto role assignment block]] is useful to temporary restrict (or enhance) user abilities while doing something.
# [[Under supervision block]] is much better than IP filter+time control if you want students to be able to do something only under teacher supervision.

===Libraries===
# [[Development:Auto-backup library|Auto-backup library]] - adding very small information to install.xml files it is possible to automate many tedious tasks of programming backup/restore for the plugins.

Preg question type

2010-09-21T16:06:07Z

Oasychev:

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code, regex parsing and error reporting - Oleg Sychev;
# regex parsing, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

For now new code with all this functionality located in the HEAD branch of preg question type (you could also download it from the [http://moodle.org/mod/data/view.php?d=13&rid=1901&filter=1 Modules and Plugins database] using 'latest version' link. It is works with Moodle 2.0. The code considered BETA quality - use it with care! If you find any bugs please report them on the tracker.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more b's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (WARNING: they probably will not work for Unicode (non-latin) characters for now)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operations DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add negative matching answers (AKA "missing words" from Joseph Rezeau question) - depends on developing and checking in extra_answer_fields() code
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2010-09-21T15:57:22Z

Oasychev: /* Subpattern capturing and feedback */

Preg question type is a question type using regular expression pattern matching to find if studen response is correct. It is use Perl-compatible regular expressions dialect. For detailed description of regular expression syntax see http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm

Authors:
# idea, design, question type code - Oleg Sychev;
# parsing regular expression, DFA regular expression matching engine - Dmitriy Kolesov.
We would gladly accept testers and contributors (see [[#Development plans|development plans]] section) - there is still more to be done than we have time.

===Understanding expressions===
The regular expressions - as any '''expressions''' - are just a bunch of '''operators''' with their '''operands'''. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look on them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Don't find that angle, and regular expressions could forever remain vast menace where only a few steps are sure. Let's go?

Look at a simple math expression: '''x+y*2'''. There are two '''operators''' there: '+' and '*'. The '''operands''' of '*' is 'y' and '2'. The '''operands''' of '+' is 'x' and result of 'y*2'. Easy?

Thinking about that expression deeper we could found, that there is a definite '''order of evaluation''' there, governed by operator's '''precedence'''. '*' has a precedence over '+', so it is evaluated first. You could change order of evaluation using brackets: '''(x+y)*2''' will evaluate '+' first and multiply it's results on the 2. Still easy?

One more thing we should learn about operators: their '''arity''', which is just the number of operands required. In example above '+' and '*' are '''binary''' operators - they both take two operands. Most arithmetic operators are binary, but minus has '''unary''' (single operand) form, like in this equation: '''y=-x'''. Note that unary and binary minuses work differently.

Now any epxression are just lego game, where you set a sequence of '''operators''' with correct number of '''operands''' for each ('''arity'''), taking heed of their order of evaluation using their '''precedence''' and brackets. Arithmetics expressions are for evaluating numbers. Regular expressions are for finding pattern matches in the strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

===Regular expressions===
The goal of a regular expressions is a pattern matching in the strings. So their '''operands''' are characters or characters set. '''A''' is a regular expressions too and it matches with single character 'A'. There are several way to define a character set, described below. Special characters, used to write operators,must be '''escaped''' when used as operands - preceded by backslash. Math expressions never had escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but setting pattern for matching you should be able to use any character as operand.

Still, pattern that match only with one character isn't very useful. So there comes '''operators''' that allows us to define expression matching with string of a several characters.

====Operands====
You could use those operands in you expressions:
# ''simple characters'' match with themselves
# ''escaped special characters'' if you need to use character with special meaning (like |, * or bracket) just as usual character to match you should preceed it by backslash: '''a\*''' matches with a* (while '''a*''' matches with a zero or more times), backslash is a special character too and should be escaped '''\\''' matches with \
# ''character classes'' you could specify a number of possible characters in one place in square brackets:
#* '''[ab,!]''' matches with a or b or , or !
#* ''ranges'': '''[a-szC-F0-9]''' you could specify ranges for letters and digits in character classes, mixing them with single characters
#* ''negative character classes'' starts with ^ '''[^ab]''' means any characters except a and b
#* ''escaping inside character classes'': '''[\-\]\\]''' match with - or ] or \, other characters lost their special meaning inside character class and shoudn't be escaped, but if you want to include ^ in the character class it should not be first
# ''dot meta-character'' '''.''' match with any possible character (except newline, but student coudn't enter it anywhere), you should escape dot '''\.''' if you need to match single dot.

====Operators====
Most common regular expression operators used (could anyone help expand descriptions and examples please?):
# ''concatenation'' - so simple '''binary''' operator that is doesn't have any character at all. Still it is an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
#* '''ab''' matches with ab
#* '''a[0-9]''' matches with a followed by any digit
# ''alternative'' - '''binary''' operator that lets you define a set of alternatives:
#* '''a|b''' mean a or b
#* '''ab|cd|de''' mean ab or cd or de
#* empty alternative: '''ab|cd|''' mean ab or cd or emptiness (useful as a part more complex expressions)
#* '''(aa|bb)c''' mean aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' mean aac or bbc or c - typical use of emptiness
# ''quantifiers'' - '''unary''' operator that lets you define repetition of a character (or regular expression) used as it's operands:
#* '''x*''' mean x zero or more times
#* '''x+''' mean x one or more times
#* '''x?''' mean x zero or one times
#* '''x{2,4}''' mean x from 2 to 4 times
#* '''x{2,}''' mean x two or more times
#* '''x{,2}''' mean x from 0 to 2 times
#* '''x{2}''' mean x exactly 2 times
#* '''(ab)*''' mean ab zero or more times, i.e. if you want to use quantifier on more than one character, you should use brackets
#* '''(a|b){2}''' mean aa or ab or ba or bb, i.e. it is repeated alternative, not selection one alternative and repeating it

====Precedence and order of evaluation====
'''Quantifier''' has precedence over '''concatenation''' and '''concatenation''' has precedence over '''alternative'''. Let's look what it means:
# ''quantifier over concatenation'' means quantifiers are executed first and without brackets would repeat only single character:
#* '''ab*''' matches with a followed with zero or more b's
#* changing this using brackets allows us define a string repetition: '''(ab)*''' matches with ab zero or more times
# ''concatenation over alternative'' means you could define multi-character alternatives without brackets (for single character alternatives use character classes, not alternative operators) but should use brackets when you need to add something to the alternative set:
#* '''ab|cd|de''' matches with ab or cd or de
#* '''(aa|bb)c''' matches with aac or bbc - use brackets to outline alternative set
#* '''(aa|bb|)c''' matches with aac or bbc or c - typical use of an empty alternative
# ''quantifier over alternative'' means you should use brackets to repeat an alternative set (not the last character in it):
#* '''ab|cd*''' matches with ab or c followed with zero or more b's
#* '''(ab|cd)*''' matches with ab or cd, repeated zero or more time in any order, like ababcdabcdcd etc
#* note that quantifiers repeats alternative, not the definite selection from it, i.e.:
#*# '''(a|b){2}''' matches with aa or ab or ba or bb, not just aa or bb
#*# use '''a{2}|b{2}''' to match aa or bb only

====Assertions====
''Assertions'' are assertions about some part of the string that doesn't actually goes into matching text, but affects whether matching occur or not.
* ''positive lookahead assert'' '''a+(?=b)''' matches with any number of a ending with b without including b in the match
* ''negative lookahead assert'' '''a+(?!b)''' matches with any number of a that is not followed by b
* ''positive lookbehind assert'' '''(?<=b)a+''' matches with any number of a preceeded by b
* ''negative lookbehind assert'' '''(?<!b)a+''' matches with any number of a that is not preceeded by b

====Matching====
'''Matching''' means finding a part of the student answer (or a whole answer) that suited the regular expression. This part called a '''match'''.

You should enter regular expressions as '''answers''' to the question without modifiers or enclosing characters (modifiers would be added for you by question - '''u''' added always and '''i''' in case-insensitive mode). You should also enter one correct response (that matches at least one 100% grade regular expression) to be shown to the student as '''correct answer'''. The question would get use all regular expressions in order to find first full match (full for expression, but not necessary all response - see [[#Anchoring|anchoring]]) and give a grade from it. If there is no full match and engine supports partial matching (see [[#Hinting|hinting]]) than partial match that is shortest to complete would be choosen (for displaying a hint, zero grade is given) - or the longest one, if engine coudn't tell which one would be shortest to complete.

====Anchoring====
Anchoring sets restrictions on the matching process:
* if a regular expression starts with '''^''' the match should start at the start of the student's response;
* if a regular exhression ends with '''$''' the match should ends at the end of the student's reponse;
* otherwise regular expression match could be contained anywhere inside student response.

If you set '''exact matching''' options to yes (default setting), the question would add ^ and $ in each regular expression for you. However, you may prefer to use some non-anchored regexes to catch common errors and give feedback while using manually anchored expression for grading.

===Hinting===
Some matching engines could support hinting (not easy thing to do on the PHP at all) in adaptive mode.

Hinting starts with '''partial matching'''. When a student enters a partially correct answer, partial matching could find that response starts matching and on some character broke it. Consider you enter expression:
'''are blue, white(,| and) red'''
and student answered
they are blue, vhite and red
Partial matching will find that partial match is
are blue,
Remember, the regular expresion in unanchored so the match shouldn't start with the start of the student response. While using just partial matching the student will be shown correct and incorrect parts:
they are blue, vhite and red
When hinting is available, student will have '''hint''' button by pressing which he receive a hint with one next correct character, highlighted by background coloring:
they are blue, wvhite and red
You should typically set hint '''penalty''' more than usual question '''penalty''', because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.
Preg question doesn't add hint character to the student's response (like regex question do it), showing it separately instead for a number of reasons:
# it is student's responsibility whether he want to add hinted character to the his response (and some more possibly);
# it slightly facilitates thinking about hint, since when the response is modified it is too easy to repeatedly press '''hint''', which is not a desirable behavour usually.
When possible (if question engine supports it), hinting choosing a character that leads to shortest path to complete a match. Consider this response to the previous regular expression:
are blue, white; red
There are two possible hint characters: ',' or ' '. The question will choose ',' since it leads to the shortest path to complete a match, while ' ' leads to a path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you add an expression for the students with bad memory:
'''are white(,| and) red'''
with 60% grade and feedback about forgetting ''blue''. You may not want hinting to lead student to the response
are white, red
if he entered
are white, oh I forgot other colors.
'''Hint grade border''' controls this. Only regular expressions with grade greater or equal than hint grade border would be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression would be used to hinting, if you set it to 0,5 regular expressions with 50%-100% grades would be used and 0%-49% would not. Regular expressions not used for hinting works only when they have a full match in the student response.

===Subpattern capturing and feedback===
Any pair of round brackets in the regular expressions are considered a '''subpattern''' and when doing matching engine (supporting subpatterns) remember ('''capture''') not only whole match, but it's parts corresponding to all subpatterns. Subpatterns can be nested. If subpattern is repeated (i.e. have quantifier), than only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Asserts don't create subpatterns.

Subpatterns are counted from left to right by opening brackets. Precisely '''0''' is the whole match, '''1''' is first subpattern etc. You could insert them in the ''answer's feedback'' using simple placeholders: '''{$0}''' is replaced by the whole match, '''{$1}''' by first subpattern value etc. That can improve the quality of you feedback. Placeholders won't work on the ''general feedback'' because different answers could have different number of subpatterns.

'''PHP preg engine''' support full subpattern capturing. '''DFA''' engine coudn't do it, so you could use only {$0} placeholder working with DFA engine.

Let's look at regex defining an decimal number with optional integral part:
[+\-]?([0-9]+)?\.([0-9]+)
It has two subpatterns: first capturing integral part, second - fractional part of the number.
You writed feedback:
The number is: {$0} Integral part is {$1} and fractional part is {2}
Then entering
123.34
the student will see
The number is: 123.34 Integral part is 123 and fractional part is 34
If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

===Error reporting===
Native PHP preg extension functions only report if there error in regular expression or not, so '''PHP preg extension''' engine couldn't tell you much about what is error .

'''DFA''' engine use custom '''regular expression parser''', so it supports advanced error reporting. The are several class of potential errors reported:
* unclosed square brackets of character class;
* unclosed opening parenthesis of any sort (different forms of subpatterns and assertions);
* unopened closing parenthesis;
* empty parenthesis of any sort (different forms of subpatterns and assertions);
* quantifiers without operand, i.e. at the start of (sub)expression with nothing to repeat;
* three or more top-level alternatives in the conditional subpattern.
PCRE (and preg functions) treat most of them as '''non-errors''', making many characters meaning context-dependent. For example quantifier {2,4} placed at the start of regular expression lose it's meaning as quantifier and is treated as five-characters sequence instead (that matches with {2,4}). However such syntax is very prone to errors and make writing regular expression harder.

For now I'm vote for reporting errors instead of treating them as literals, even if it means incompatibility with PCRE/preg. If you are stand for or against this decision please write you positions and reasons on the page comments please. It may be best to have two modes, but this literally means two parsers and this is out of current scope of development. There are more pressing issues ahead.

===Matching engines===
Matching engines means different program code that do matching (either by different methods or written by different people). There are no single 'best' matching engine - it depends on the features you want to use and regular expressions engine should handle. They have a different degree of stability and offer different features to use.
====PHP preg extension====
It is based on native PHP preg functions (which is in turn based on the PCRE library). It is supporting 100% perl-compatible regular expression features, been very stable and thoroughly tested. Sadly, PHP functions doesn't support partial matching (while PCRE could), so (unless we storm PHP developers to add support for partial matching) there is '''no hinting''' there. However it will support subpattern capturing. Choose it when you need complex regexp features other engines don't support, subpattern capturing or better performace.

====Deterministic finite state automata (DFA)====
This is a custom PHP code using DFA matching algorithm. It is heavily unit-tested, but considered beta-quality for now. Not all PHP operands and operators are supported, and for some (more exotic) ones support could still differs from standard (especially for non-latin characters). On the bright side it is support '''hinting'''.

Currently supported operands (there would be more):
* single characters
* escaped special characters
* character classes, including ranges and negative classes
* escape sequences \w\W\s\S\t\d\D (WARNING: they probably will not work for Unicode (non-latin) characters for now)
* octal and hexadecimal character codes preceeded by \o and \x
* meta-character . (any character)

Currently supported operators (there would be more):
* concatenation
* alternative |
* quantifiers * + ? {2,3} {2,} {,2} {2}
* positive lookahead assertions
* changing operator precedence ( ) (without subpattern capturing) or (?: )

Features that couldn't be supported by DFA matching at all:
* subpattern capturing
* backreferences

===Development plans===
There is no definite shedule or order of development for those features - it depends on the available time and developers. Many features require complex code to achieve results. If you want to help us with specific feature, please contact question type maintainer (Oleg Sychev) using http://moodle.org messaging.
* Update DFA matching engine to support all operations DFA algorithm could
* Improve Unicode support of custom matching engines
* Add automatic generation of shortest possible correct answer in user-readable form
* Add negative matching answers (AKA "missing words" from Joseph Rezeau question) - depends on developing and checking in extra_answer_fields() code
* Add generation of 'description' for regular expression to facilitate it's editing
* Develop NFA and backtracking matching engines
* Develop more help and examples for the people that don't know much about regular expressions.

[[Category:Contributed code]]

Preg question type

2010-09-18T20:27:36Z

Oasychev:

Preg question type

2010-09-18T20:26:35Z

Oasychev: /* PHP preg extension */

Preg question type

2010-09-18T20:25:34Z

Oasychev: /* Development plans */

Preg question type

2010-09-18T20:24:27Z

Oasychev: /* Hinting */