Note: You are currently viewing documentation for Moodle 3.1. Up-to-date documentation for the latest stable version of Moodle is probably available here: Preg question type.

Preg question type: Difference between revisions

From MoodleDocs
mNo edit summary
Line 163: Line 163:
; '''testing tool''' : allows you to enter strings and see how they match your regex
; '''testing tool''' : allows you to enter strings and see how they match your regex


INSTALLATION NOTE. To have ''syntax tree'' and ''explaining graph'' tools working you need to have the [http://www.graphviz.org/Graphviz] package installed on your server and fill 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. It is used by authoring tools code to draw pictures for you.
===Installation note and known technical issues===
To have ''syntax tree'' and ''explaining graph'' tools working you (or your site admin) have to install Graphviz[http://www.graphviz.org/Graphviz] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.
 
Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.


===Regular expression area===
===Regular expression area===

Revision as of 18:39, 30 August 2013


Preg is a question type that uses regular expressions (regexes) to check student's responses (though you can use it without regexes for its hinting features). Regular expressions give vast capabilities and flexibility to both teachers when making questions and students when writing answers to them. First section should guide you to using of this docs, please use it with discretion. More details about regex syntax can be found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm. There are many good regex manuals, I'm not going to repeat them here.

Authors:

  1. Idea, design, question type and behaviours code, hinting and error reporting - Oleg Sychev.
  2. Regex parsing, NFA regex matching engine, testing of the matchers, backup&restore and unicode support - Valeriy Streltsov.
  3. DFA regex matching engine - Dmitriy Kolesov.
  4. Explaining graph (authoring tool) - Vladimir Ivanov.
  5. Explaining tree, regular expression testing (authoring tools) - Grigory Terekhov.
  6. Regex description (authoring tool) - Dmitriy Pahomov.

We would gladly accept testers and contributors (see the development plans section) - there is still more work to be done than we have time. Thanks to Joseph Rezeau for being devoted tester of Preg question type releases and being the original author of many ideas that have been implemented in Preg question type. You, too, could aways help us a lot - regardless of the way you use Preg and your capabilities.


Ways to use Preg questions and this docs

I don't (want to) know anything about regular expressions but next word (character) hinting seems useful

Then you can use Preg question type just as Shortanswer with advanced hinting, without any knowledge about regular expressions. To do this, you need to choose

  • Notation => Moodle shortanswer
  • Engine => Non-deterministic finite state automata
  • Exact matching => Yes

After that, you can just copy answers from you shortanswer questions. You may want to read the section about hinting to understand more about hinting settings.

I have a vague knowledge of regular expressions, but want to use pattern matching

If writing regular expressions is hard for you, but you want to use their strength as patterns, authoring tools may help you a lot to create your questions. The tools show you the meaning of your regex in different ways: internal structure of the expression (syntax tree), visual path of matching (explaining graph) and a text description. They also allow you to test you regex against several strings and see if it works as expected. Experiment and play with your regexes, see corresponding changes in the authoring tools, and eventually you'll get the regex you want.

Read the section on authoring tools, than (probably after some experimenting with tools on your own) a start of section about understanding regular expressions (this is optional, but may be interesting and help a lot). You should also read a section about question working to better understand various settings and how they affects you questions.

I can make some effort to learn regular expressions well and be able to do anything they allow

Well, you don't know regexes but want to understand them and create complex expressions easily. Then, instad of blunt trying, you better spend some time and effort reading and understanding this section. Then read slightly about authoring tools and use them to experiment creating regexes. With these tools you can see if you really understand them well and they behave as expected. Syntax tree may be especially useful when you try to get the right meaning of precedence and arity. After you understand the principles of regexes well, read sections about question working and regular expression reference (to know your possibilities, don't bother to understand or remember them all - just look there periodically for something new to learn). Now you should be able to write regexes without much use of authoring tools, except the testing tool to test your expressions.

I know regular expressions well enought to write them on my own without further guidance

You should read about question working to understand various settings and question behaviour under them. You also may be interested in regex testing in the authoring tools section. Finally, regular expression reference may be of some use for you.


How Preg questions work

Basically, this question type is an extended version of Shortanswer. It extends its features in several different ways (you could use them in almost any combination):

  • Pattern matching - using regular expressions you can create powerful patterns describing possible students answers
  • Hinting - when students are stuck doing the question, you may allow them to ask for a next correct word (lexem) or a character (with possible penalty)

Settings affecting question work

Sets the case sensitivity for all regular expressions you specify as answers. Note that you can also set the case sensitivity for regex parts.

Exact matching affects the question in the following way:

Yes
the entire student's response, from the first to the last letter, should match your regular expression
No
student's response can just contain a part that matches your regex: for example, if the correct answer is "whole" then "the whole string" will be a correct student response

You still can set some of your regexes to match the whole student's response using special regex syntax.

Notations specify the "language" of your answers.

Regular expression
a usual notation for regular expression. Precisely it is Perl-compatible regex dialect. You may write regex on multiple strings for better reading - line breaks will be ignored.
Regular expression (extended)
useful for really complex regexes. It is similar to the PHP 'x' modifier. It will ignore any unescaped whitespaces in you regexes, that are not inside character classes (use \s instead) - so that you may freely format you regexes with spaces. It will also ignore line breaks with one useful exception: everything after '#' character untill the end of string is treated as commentary (# should not be escaped and should not be inside a character class).
Moodle shortanswer
use it to avoid regex syntax at all: just copy answers from you shortanswer questions. The '*' wildcard is supported. By choosing NFA engine you can get access to the hinting features. You can skip all that is said on regexes there, but be sure to read the hinting section to understand various settings you can alter to configure you question hinting behaviour.

Matching engine specifies the program module that performs the regex matching. There is no 'best' matching engine - it depends on the features you want to use. Engines have different stability and offer different features to use.

PHP preg extension
should be used when you don't need hinting and other engines are rejecting you expressions as too difficult or you encounter bugs in them. It is based on the native PHP preg_ functions. It supports 100% perl-compatible regex features, it is very stable and thoroughly tested. But it doesn't support partial matching, so (unless we storm PHP developers to add support of partial matching) there is no hinting. However it supports subpattern capturing. Choose it when you need complex regex features that other engines don't support.
Non-deterministing finite state automata(NFA)
can be used to perform hinting for your students. NFA engine is a custom PHP code, it allows many (but not all) regex features and is thoroughly tested (it passes all tests from AT&T testregex suite and most tests from PCRE testinput1 suite for the features it supports, which means quite much), but still may contain bugs in rare cases. Unsupported features for now are lookaround assertions, recursion and conditional subpatterns.
Deterministic finite state automata (DFA)
WARNING - this engine lacking support the past year. Use NFA engine instead if you can (i.e. don't get rejected regexes that this engine accepts), it allows almost anything DFA engine does, but NFA engine much more tested and stable.

Hinting

Hinting is supported by NFA and DFA engines in adaptive and interactive behaviours.

Partial matching

Hinting starts with partial matching. By partially correct response we understand a string that starts with correct characters (matching your regex) but on some character the match breaks. Assume you entered the regex

 "are blue, white(,| and) red"

and a student answered

 "they are blue, vhite and red"

In this situation the partial match is

 "are blue, "

Note that the regex is unanchored ("Exact match" is set to "No") so the match may not start with the first character of the student's response (like in the example above: "they " is skipped). While using just partial matching the student will see the correct and incorrect parts:

 they are blue, vhite and red

General hinting rules

Preg question type doesn't add hinted characters to the student's response (unlike the REGEXP question type), showing it separately instead for a number of reasons:

  1. It is student's responsibility whether he wants to add hinted character to the his response (and some more possibly).
  2. It slightly facilitates thinking about a hint, since when the response is modified it is too easy to repeatedly press hint, which is not usually a desirable behaviour.

When possible, hinting chooses a character that leads to the shortest path to complete the match. Consider this response to the previous regular expression:

 are blue, white; red

There are two possible hint characters: "," or " " (leading to the " and" path). The question will choose "," because it leads to the shortest path to complete the match, while " " leads to the path 3 characters longer.

It is possible that not all regular expressions will give 100% grade. Consider you added an expression for students with bad memory:

 are white(,| and) red

with 60% grade and feedback about forgetting blue. You may not want hinting to lead student to the response

  are white, red

if he entered

  are white, oh I forgot the other colors.

Hint grade border controls this. Only regular expressions with the grade greater or equal than the hint grade border will be used for partial matching and hinting. If you set hint grade border to 1, only 100% grade regular expression will be used for hinting, if you set it to 0,5 regular expressions with grades from 50% to 100% will be used for hinting and 0%-49% would not. Regular expressions not used for hinting work only when they have a full match with the student response.

Next character hinting

When next character hinting is available, student will have the hint next character button by pressing which he receives one next correct character, highlighted by background coloring:

 they are blue, wvhite and red

You should typically set the hint penalty more than usual question penalty, because they are applied separately: usual penalty for an attempt without hinting, while hint penalty for an attempt with hinting.

Next lexem (word) hinting

Lexem means an atomic part of a language. For natural language a word, a number, a punctuation mark (or group of marks like '?!' or '...') are lexemes. For a programming language it can be a keyword, a variable name, a constant, an operator. Note that spaces are usually not considered to be lexems, but separators between them, since they don't have any particular meaning.

Next lexem hint will show student either completion of the current lexem (if partial match ends inside it) or next one (if student complete the current lexem). Like

  are blue

or

  are blue,

or

  are blue, white

Preg question type, since the 2.3 release, allows usage of next lexem hinting using the formal languages block. You should choose the language in which you expect a response for you question, since lexem borders are different for different languages. For now it supports these languages (there will be more):

simple english
english language scanner recognize words, numbers and punctuation marks;
C/C++ language
a programming language C (or C++);
printf language
a special language for formatting strings in C/C++ programming language, you will have it disabled probably.

Administrator of the site can control what languages are available to the teachers, to avoid confusion. See the settings of the block "Formal languages" in the plugin settings menu.

Note that "lexem" typically isn't the word you would like your students to see on the hinting button. Each language define their own word for it. You can enter another word in the question description, if you don't like default ones.

Subpattern capturing and feedback

Any pair of parentheses in a regex are considered as a subpattern and when matching the engine remembers (captures) not only the whole match, but its parts corresponding to all subpatterns. Subpatterns can be nested. If a subpattern is repeated (i.e. have quantifier), only last match of all repeats will be captured. If you want to change order of evaluation without defining a subpattern to capture (which will speed up processing), you should use (?: ) instead of just ( ). Lookaround assertions don't create subpatterns.

Subpatterns are counted from left to right by opening parentheses. Precisely 0 is the whole regex, 1 is first subpattern etc. You can insert them in the answer's feedback using simple placeholders: {$0} will be replaced by the whole match, {$1} by the first subpattern value etc. That can improve the quality of you feedbacks. Placeholders won't work in the general feedback because different answers can have different number of subpatterns.

PHP preg engine and NFA support full subpattern capturing. DFA engine can't do this, so you can use only {$0} placeholder when using the DFA engine.

Let's look at a regex defining a decimal number with optional integral part:

[+\-]?([0-9]+)?\.([0-9]+)

It has two subpatterns: first capturing integral part, second - fractional part of the number. If you wrote the feedback:

The number is: {$0} Integral part is {$1} and fractional part is {$2}

Then a student entered

123.34

He will see

The number is: 123.34 Integral part is 123 and fractional part is 34

If no integral part is given, {$1} will be replaced by empty string. There is no way (for now) to erase "Integral part is" under that circumstances - the placeholder syntax may become complex and prone to errors.

Looking for missing and misplaced things

Joseph Rezeau's REGEXP question type has a missing words feature, allowing to define an answer that will work when something is absent in the answer (and give an appropriate feedback to the student).

Similar effect can be achieved with negative assertions combined with anchoring the matching start. The regular expression to look for the missing word necessary would be

 ^(?!.*\bnecessary\b.*)

where

  • (?!.*\bnecessary\b.*) is a negative lookahead assertion, that allows matching only if there is no word necessary ahead of some point in the string;
  • ^ is an assertion too, that anchores the match to the start of the response (otherwise there would be places in response after the word "necessary", where matching is possible even if the word is present).

In case if the description is difficult to you, just surround regexp to be missing with ^(?! and ). Don't try '--' syntax, that is specific to Jospeh Rezeau's REGEX question type!

You can also have a rough search for misplaced words (it will actually work only if anything else is correct) using syntax like this:

  (?!<I\s+)\bam\b(?!\s+victor)

This expression catches misplaced "am" in the sentence "I am victor" by first looking for "am" doens't have "I" before it ("(?!<I\s+)" part) and then "victor" after it ("(?!\s+victor)" part). "\s+" allows any number of spaces between words. If you want to catch the first (last) word (punctuation mark, etc) - then you should place simple assertions for start/end of string ("^" or "$") instead of words in related assertions. For instance to look for misplaced "I" you should write something like

  (?!<^)\bI\b(?!\s+am)

which looks for "I" that is not preceded by start of the string and not followed by "am".

Note, that if you have several answers to catch missing and misplaced things, only one will actually work for any given student response.

Since the Preg 2.3 release you can combine hints and catching missing words. But you should be sure that the answers that look for missing things (and other to give specific feedback) have a fraction (grade) lower, that hint grade border (see #Hinting). You actually don't want to generate hints for these answers, as they don't define a correct situation, so it's not problem but a feature.

Authoring tools

Authoring tools are there to help you write, test and understand you regexes. For now they can show you the meaning of written regex (and its parts), and test it. Authoring tools are activated by pressing the "edit" icon near the regex field.

authoring tools icon

There are four authoring tools available:

syntax tree
shows you the inner structure of regular expressions
explaining graph
shows you how your expression will work in a graphical way
description
formulates the meaning of your expression in English
testing tool
allows you to enter strings and see how they match your regex

Installation note and known technical issues

To have syntax tree and explaining graph tools working you (or your site admin) have to install Graphviz[1] on the server and fill the 'pathtodot' setting on you Moodle installation at Site Administration > Server > System Paths. Graphviz is used to draw pictures for you.

Syntax tree and explaining graph may not work correctly in old Opera versions - for some reason the images are not updated on user actions. Fortunately, there's a newer version 16 for Windows which works with authoring tools pretty well. On Linux you will have to use something else.

Regular expression area

Here you can edit your regular expression. Clicking on "Update" sends the regex to all tools - so syntax tree, explaining graph, description and testing results are updated. "Save" closes the authoring tools form and saves the regex in the main question editing form. "Cancel" closes the authoring tools form and discards all changes made there. The last, "TODO" button, helps you to trace the interrelation between the regex itself (text representation) and the other representations: you can select a regex part and see where it is located in the syntax tree and in the graph.

Syntax tree

As was said above, regular expressions, like all expressions, are trees of operators and operands. Syntax tree shows the inner structure of expression graphically: what is inside what. This will be the most useful if you know how to understand regular expressions or learning to do this.

If you don't understand operators and precedence conception well, it may have a small meaning to you. But it is still useful to find out, where you need parentheses: cf. trees for ab+ (a) and (ab)+ (b) on the picture below. parenthesis in the structure of regex

The part of expression you selected is shown by dotted part of the tree.

leftmost node of the tree is selected

Explaining graph

The graph shows how regular expression works. Its nodes are matched characters, its edges show paths throught the nodes from the beginning to the end. alternatives and concatenation

Oval nodes represent individual characters, character sequences (so that graph isn't extremly big) or single special character classes (in which case they change line colour). Complex character classes are shown as rectangles. Simple assertions are checked between nodes, so they are written on the edges.

graph for regex ^\dabc[!,0-9]$

Dotted rectangles shows you repeated parts of you expression.

graph for regex \d*

Solid line rectangles show you subexpressions. When expression is matched, it remembers which part of the string matched each subexpression. You could insert it in the feedback or use in backreference in expression. If you do not need to remember subexpressions, you may use (?: ) instead of ( ) parentheses, that will speed up matching.

de)f

 TODO Green rectangle shows you selected part of expression.

Description

Description try to formulate a sentence, describing you how expression is supposed to work.

Testing tool

You can enter a set of strings there, one per line. These strings will be matched against your expression. You'll see a coloured strings, showing which parts of your strings matched the expression, so you can test if it works as you expcted.

TODO The strings will be saved in database, if you save regex (they will be lost if you close window with "cancel" button).

Understanding regular expressions

Understanding expressions in general

Regular expressions - as any expressions - are just a bunch of operators with their operands. Don't worry - you all learned to master arithmetic expressions from chilhood and regular ones are just as easy - if you look at them from the right angle. Learn (or recall) only 4 new words - and you are a master of regexes with very wide possibilities. Let's go?

Look at a simple math expression: x+y*2. There are two operators: '+' and '*'. The operands of '*' are 'y' and '2'. The operands of '+' are 'x' and the result of 'y*2'. Easy?

Thinking about that expression deeper we can find that there is a definite order of evaluation, governed by operator's precedence. The '*' has a precedence over '+', so it is evaluated first. You can change the evaluation order by using parentheses: (x+y)*2 will evaluate '+' first and multiply the result by 2. Still easy?

One more thing we should learn about operators is their arity - this is just the number of operands required. In the example above '+' and '*' are binary operators - they both take two operands. Most of arithmetic operators are binary, but the minus has also the unary (single operand) form, like in this equation: y=-x. Note that the unary and binary minuses work differently.

Now any epxression are just a lego game, where you set a sequence of operators with correct number of operands for each (arity), taking heed of their evaluation order by using their precedence and parentheses. Arithmetic expressions are for evaluating numbers. Regular expressions are for finding patterns in strings, so they naturally use another operands and operators - but they are governed by the same rules of precedence and arity.

Regular expressions

Regular expressions is a powerful mechanism for searching in strings using patterns. So their operands are characters or a sets of characters, that is allowed in particular position. A is a regular expressions that matches a single character 'A'. The operators in regular expressions define a way to combine individual characters in the pattern: sequence (concatenation operator), alternative and repeating (it is called quantifier). The concatenation is so simple operator, that it doesn't have any character for it at all - just write some characters in sequence, and they'll be concatenated. But it is still have precedence, so that the question can see, did you want to repeat a single character or a sequence of them. Alternative is written as vertical bar. There are many form of quantifiers - most commonly used are question mark (repeat zero or one times), asterisk (zero or more times) and plus (one or more times). You may specify mininimum and maximum number of repeats in curly braces - this is a quantifier too.

The special characters that define operators should be escaped when used as operands - preceded by a backslash. Mathematical expressions never have escaping problems since their operands (numbers, variables) are constructed from different characters than operators (+,- etc), but when constructing a pattern for matching you should be able to use any character as an operand.

Character classes allows you to specify several possible characters for one place. They can be defined in many different ways: by enumeration of characters in square brackets [as3], by ranges in square brackets [a-z], by special sequences (\d means any digit, \W anything except a letter, digit and underscore, [[:alpha:]] any letter etc). An important type of operand is a simple assertions: they allow you to test some conditions - start of the string ^, end of the string $ or word border \b.

You could find a list and more examples of operands and operators in reference section.

Precedence and order of evaluation

A quantifier has precedence over concatenation and concatenation has precedence over alternative. Let's look what it means:

  1. quantifiers over concatenation means that quantifiers are executed first and will repeat only a single character if used without parentheses:
    • "many times*" matches "manytime" followed by zero or more "s";
    • "(many times)*" matches "many times" zero or more times - changing the previous regex by using parentheses allows us define a string repetition;
  2. concatenation over alternative means that you can define multi-character alternatives without parentheses (for single character alternatives it's better to use character classes, not the alternative operator):
    • "first|second|third" matches "first" or "second" or "third";
    • "(first |second |)part" matches "first part" or "second part" or just "part" - typical use of an empty alternative (note that space is in alternative to not require it before just "part");
  3. quantifier over alternative means that you should use parentheses to repeat an alternative set:
    • "first|second*" matches "first" or "secon" followed by zero or more "d" like "secondddddd";
    • "(first|second)*" matches "first" or "second", repeated zero or more time in any order, like "firstsecondfirstfirst". Note that quantifiers repeat the whole alternative, not a definite selection from it, i.e.:
    • "(1|2){2}" matches "11" or "12" or "21" or "22", not just "11" or "22";
    • "1{2}|2{2}" matches "11" or "22" only.

An internal structure of regular expression can be viewed well on the syntax tree (authoring tool). The operators that executed first are placed lower on the tree (or to the right on horizontal view), the operator that executed last is the root of the tree. You can compare tree and explaining graphs for the examples above in authoring tools if this section doesn't seems too clear to you. Remember, that "execution" of regular expression operator means linking them in the string: sequental, alternative linking, or repeating.

Anchoring

Anchoring is used to set restrictions on the matching process by using simple assertions:

  • if a regular expression starts with the ^ the match should start at the start of the student's response;
  • if a regular expression ends with the $ the match should end at the end of the student's reponse;
  • otherwise a regex match can be found anywhere inside a student's response.

Note that simple assertions are concatenated with regex and concatenation has precedence over alternative, this makes it's usage slightly tricky:

  • "^start|end$" will match "start" from the start of the string or "end" at the end of it;
  • "^(start|end)$" using brackets to match exactly with "start" or "end";
  • "^start$|^end$" is another way to get exact match (all top-level alternatives are anchored).

If you set the exact matching options to "yes" (which is the default value), the question will add ^ and $ in each regular expression for you (it will not affect subpattern usage). However, you may prefer to use some non-anchored regexes to catch common errors and give feedback and use manually anchored expressions for grading.

Regular expressions reference

Operands

Here's an incomplete list of operands that define character sets.

  1. Simple characters (with no special meaning) match themselves.
  2. Escaped special characters match corresponding special characters. Escaping means preceding special characters by the backslash "\". For example, the regex "\|" matches the string "|", the regex "a\*b\[" matches the string "a*b[". Backslash is a special character too and should be escaped: "\\" matches "\".
    • full list of characters needs escaping \ ^ $ . [ ] | ( ) ? * + { }
    • NOTE! when you are unsure whether to escape some character, it is safe to place "\" before any character except letters and digits. Do not escape letters and digits unless you know what you are doing - they get special meaning when escaped and lose it when not.
    • If you have too many characters that need escaping in some fragment, you can use \Q ... \E sequence instead. Anything between \Q and \E is treated literally as characters:
      • "\Q^(abc)$\E." matches "^(abc)$" followed by any character - there are NO simple assertions and subpatterns;
      • "\Q^(abc)$." matches "^(abc)$." because there is no "\E" and all characters after "\Q" are treated as literals till the end of the regex.
  3. Dot meta-character (".") matches any possible character (except newline, but students can't enter it anywhere), escape it "\." if you need to match a single dot. Loses it's special meaning inside character class.
  4. Character classes match any character defined in them. Character classes are defined by square brackets. The particular ways to define a character class are:
    • "[ab,!]" matches "a", "b", "," or "!";
    • "[a-szC-F0-9]" contains ranges (defined by a hyphen between 2 characters) "a-z", "C-F" and "0-9" mixed with the single character "z", it matches any character from "a" to "s", "z", from "C to "F" and from "0" to "9";
    • "[^a-z-]" starts with the "^" that means a negative character set: it matches any character except from "a" to "z" and "-" (note that the second hyphen is not placed between 2 characters so defines itself);
    • "[\-\]\\]" contains escaping inside a character set: it matches "-", "]" and "\", other characters loose their special meaning inside a character set and can be be not escaped, but if you want to include "^" in a character set it shouldn't be first there;
  5. Escape sequences for common character sets (can be used both inside or outside character classes):
    • "\w" for any word character (letter, underscore or digit) and "\W" for any non-word character;
    • "\s" for any space character and "\S" for any non-space character;
    • "\d" for any digit and "\D" for any non-digit.
  6. Unicode properties are special escape-sequences "\p{xx}" (positive) or "\P{xx}" (negative) for matching specific unicode characters which could be used both inside or outside character classes (the complete list of "xx" variations can be found at found at http://www.nusphere.com/kb/phpmanual/reference.pcre.pattern.syntax.htm):
    • "\p{Ll}" matches any lowercase letter;
    • "\P{Lu}" matches any non-uppercase letter.
  7. POSIX character classes are used for the same purpose as unicode properties (and complete list of them can be found on the Internet too), but may not work with non-ASCII characters. They are allowed only inside character classes:
    • "[[:alnum:]]" matches any alpha-numeric character;
    • "[[:^digit:]]" matches any non-digit chararcter.
  8. Simple assertions - they are not characters, but conditions to test, they don't consume characters while matching, unlike other operands (have those meaning only outside character classes):
    • "^" matches in the start of the string, fails otherwise;
    • "$" matches in the end of the string, fails otherwise;
    • "\b" matches on a word boundary, i.e. either between word (\w) and non-word (\W) characters, or in the start (end) of the string if it starts (ends) with a word character;
    • "\B" matches not on a word boundary, negative to "\b".

Still, a pattern that matches only one character isn't very useful. So here come the operators that allow us to define an expression that matches strings of several characters.

Operators

Here's a list of the common regex operators:

  1. Concatenation - so simple binary operator that doesn't require any special character to be defined. It is still an operator and has it's precedence, which is important if you want to understand where to use brackets. Concatenation allows you to write several operands in sequence:
    • "ab" matches "ab";
    • "a[0-9]" matches "a" followed by any digit, for example, "a5"
  2. Alternative - a binary operator that lets you define a set of alternatives:
    • "a|b" matches "a" or "b";
    • "ab|cd|de" matches "ab" or "cd" or "de";
    • "ab|cd|" matches "ab" or "cd" or emptiness (useful as a part in more complex expressions);
    • "(aa|bb)c" matches "aac" or "bbc" - using parentheses to outline alternative set;
    • "(aa|bb|)c" matches "aac" or "bbc" or "c" - typical usage of the emptiness;
  3. Quantifiers - an unary operator that lets you define repetition of something used as its operand:
    • "x*" matches "x" zero or more times;
    • "x+" matches "x" one or more times;
    • "x?" matches "x" zero or one times;
    • "x{2,4}" matches "x" from 2 to 4 times;
    • "x{2,}" matches "x" two or more times;
    • "x{,2}" matches "x" from 0 to 2 times;
    • "x{2}" matches "x" exactly 2 times;
    • "(ab)*" matches "ab" zero or more times, i.e. if you want to use a quantifier on more than one character, you should use parentheses;
    • "(a|b){2}" matches "aa" or "ab" or "ba" or "bb", i.e. it is a repeated alternative, not a repetition of "a" or "b".

Subpatterns and backreferences

Subpatterns are operators that remember substrings captured by the regex. The simplest way to define a subpattern is to use parentheses: the regex "a(bc)d" contains a subpattern "bc". Subpatterns are numerated from 0 for the whole regex and counted by opening parentheses. That "(bc)" subpattern is the 1st. If we write, say, "a(b(c)(d))e" - there are subpatterns "bcd" which is 1st, "c" which is 2nd and "d" which is 3rd. Subpatterns are usually used with backreferences which, too, have numbers. Backreferences are operands that match the same strings which are matched by the subpatterns with the same numbers. The simplеst syntax for backreferences is a slash followed by a number: "\1" means a backreference to the 1st subpattern. The regular expression "([ab])\1" matches strings "aa" and "bb", but neither "ab" nor "ba" because the backreference should match the same character as the subpattern did. Constider a little example: declaration and initialization of an integer variable in C programming language:

  • "int ([_\w][_\w\d]*); \1 = -?\d+;" matches, for example, "int _var; _var = -10;". Of course, there can be any number of spaces between "int", variable name etc, so a more correct regex will look like:
  • "\s*int\s+([_\w][_\w\d]*)\s*;\s*\1\s*=\s*-?\d+\s*;\s*" - this will match, say, " int var2  ; var2=123  ; ". Looks a bit frightning, but it is easier to write this regex once than to try understand it after.

Finally, instead of just numbers, subpatterns and backreferences can have names via a little more complicated syntax:

  1. "(?<name1>...)" means a subpattern with name "name1";
  2. "(?'name2'...)" means a subpattern with name "name2";
  3. "(?P<name3>...)" means a subpattern with name "name3";
  4. "\k<name4>" means a backreference to the subpattern named "name4";
  5. "\k'name5'" means a backreference to the subpattern named "name5";
  6. "\g{name6}" means a backreference to the subpattern named "name6";
  7. "\k{name7}" means a backreference to the subpattern named "name7";
  8. "(?P=name8)" means a backreference to the subpattern named "name8".

This is very useful when you work with complicated regexes and often modify it by adding or removing subpatterns - names stay the same.

Duplicate subpattern numbers and names

There is a useful syntax when combining subpatterns with alternation. If you create a group "(?|...)" than every alternative inside that group will have the same subpattern numeration. Consider the regex "(?|(a(b))|(c(d)))" - there are 2 alternatives with 2 subpatterns in each. Subpatterns "ab" and "cd" are 1st ones, "b" and "d" are 2nd ones.

Complex assertions

Assertions about some part of the string don't actually go into matching text, but affect the matching occurrence:

  • positive lookahead assertion "a+(?=b)" matches any number of "a" ending with "b" without including "b" in the match;
  • negative lookahead assertion "a+(?!b)" matches any number of "a" that is not followed by "b";
  • positive lookbehind assertion "(?<=b)a+" matches any number of "a" preceeded by "b";
  • negative lookbehind assertion "(?<!b)a+" matches any number of "a" that is not preceeded by "b".

Local case-sensitivity modifiers

Starting from Preg 2.1 you can set case-(in)sensitivity for parts of your regular expressions by using the standard syntax of Perl-compatible regular expressions:

  • "(?i)" will turn case-sensitivity off;
  • "(?-i)" will turn case-sensitivity on.

This affects general case-sensitivity, which is choosen on the question level. So you can make some answers case-sensitive and some not, or even do this for the parts of answers. For example you can set question as "use case" and have a 50% answer starting with "(?i)" to grade lesser when the case doesn't match, but everything else is correct.

When placed in parentheses, local modifiers work up to the closest ")". When placed on the top level (not inside parentheses) they work up to the end of the expression, i.e. with case sensitivity on for the question:

  • "abc(de(?i)gh)xyz" will have the bold part case-insensitive;
  • "abc(de)(?i)ghxyz" will have the bold part case-insensitive.

Error reporting

Native PHP preg extension functions only report if there is an error in regular expression or not, so PHP preg extension engine can't tell you much about the error.

NFA and DFA engines use a custom regular expression parser, so they support the advanced error reporting. The are several classes of potential errors:

  • more than two top-level alternatives in a conditional subpattern "(?(?=f)first|second|third)";
  • unopened closing parenthesis "abc)";
  • unclosed opening parenthesis of any sort (subpatterns, assertions, etc) "(?:qwerty";
  • quantifier without an operand, i.e. at the start of (sub)expression with nothing to repeat "+" or "a(+)";
  • unclosed brackets of character classes "[a-fA-F\d";
  • setting and unsetting the same modifier at the same time "(?i-i)";
  • unknown unicode properties "\p{Squirrel}";
  • unknown posix classes "[[:hamster:]]";
  • unknown (*...) sequence "(*QWERTY)";
  • incorrect character set range "[z-a]";
  • incorrect quantifier ranges "{5,3}";
  • \ at end of pattern "ab\";
  • \c at end of pattern "ab\c";
  • invalid escape sequence;
  • POSIX class ouside of a character set "[:digit:]";
  • reference to unexisting subpattern (abc)\2;
  • unknown, wrong or unsupported modifier "(?z)";
  • missing ) after comment "(?#comment";
  • missing conditional subpattern name ending;
  • missing ) after (?C;
  • missing subpattern name ending;
  • missing backreference name ending;
  • missing backreference name beginning;
  • missing ) after control sequence;
  • wrong conditional subpattern number, digits expected;
  • assertion or condition expected "(?()a|b)";
  • character code too big "\x{ffffffff}";
  • character code disallowed "\x{d800}";
  • invalid condition (?(0);
  • too big number in (?C...) "(?C256)";
  • two named subpatterns have the same name "(?<name>a)(?<name>b)";
  • backreference to the whole expression "abc\g{0}";
  • different subpattern names for subpatterns of the same number "(?|(?<name1>a)|(?<name2>b))";
  • subpattern name expected "(?<>abc)";
  • \c should be followed by an ascii character "\cй";
  • \L, \l, \N{name}, \U, and \u are unsupported;
  • unrecognized character after (?<.

The ways to give back

This project is free software, so it's hard to get any feedback. You shouldn't expect to get software which ideally suits you needs without telling anyone about these needs, or encouragement, or some non-difficult support to the authors. Sometimes as little as writing where you work and how you use (or what prevents you from using) Preg question type may help a lot.

This software is considered a scientific project and such things could be really useful and appreciated:

  • an evidence that the results of our work (i.e. Preg questoin type) are really useful to people and were used in production environment;
  • a cooperative work to research it's effectiveness for various applications - basically you need to write about how you use Preg and make some survey with you teachers and/or students about it - but it can include co-authoring a conference thesis or a journal article;
  • cooperating in writing article or help publishing it in English-language journals (information and help in grants for further work is welcome too).

If you consider any way of helping, do not hesitate to write me about it and ask any questions about details. You may receive individual help during such work too (for example, doing cooperative research I may give you tips how to improve you regexes, etc).

I am a high school teacher, researcher and programmer who must do much on his main paid job and have not much free time to spend on developing this question type. If you could help me in some ways, I may be able to spend more time and effort doing this though. Some examples:

  • publishing a thesis or paper describing your usage of the Preg question I could give reference for would improve rating of the project there and my rating as a researcher/developer, so please publish and let me know the reference if you feel grateful for this software;
  • if you would take some more work and organise publishing a paper (or at least thesis) with me as co-author, that would help even more - please inform me immediately if you consider this;
  • if publishing is hard, you could just write me what your organisation is and how you use preg - that'll help and I would be able to better determine what should be done next;
  • join the testing efforts - there are many settings in the question, and regexes can be quite complex, so it's hard to do all testing by developers themselves.

Development plans

There is no definite shedule or order of the development for those features - it depends on the available time and developers. Many features require complex code to achieve the results. If you want to help us with a specific feature, please contact the question type maintainer (Oleg Sychev) using http://moodle.org messaging.

  • Improve simple assertions support
  • Support for complex assertions
  • Support for regular expresison recursion
  • Support for approximate matching to catch typos in answers
  • Improve a set of authoring tools to make writing regular expressions easier
  • Add more languages for next lexem hinting
  • Develop more help and examples for the people that don't know much about regular expressions.